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Abstract 

In this paper we characterize sharp time-data tradeoffs for optimization problems used for 
solving linear inverse problems. We focus on the minimization of a least-squares objective subject 
to a constraint defined as the sub-level set of a penalty function. We present a unified convergence 
analysis of the gradient projection algorithm applied to such problems. We sharply characterize 
the convergence rate associated with a wide variety of random measurement ensembles in terms 
of the number of measurements and structural complexity of the signal with respect to the chosen 
penalty function. The results apply to both convex and nonconvex constraints, demonstrating 
that a linear convergence rate is attainable even though the least squares objective is not strongly 
convex in these settings. When specialized to Gaussian measurements our results show that 
such linear convergence occurs when the number of measurements is merely 4 times the minimal 
number required to recover the desired signal at all (a.k.a. the phase transition). We also achieve 
a slower but geometric rate of convergence precisely above the phase transition point. Extensive 
numerical results suggest that the derived rates exactly match the empirical performance. 


1 Introduction 

In science and engineering today we gather data at unprecedented scales. Yet in many applications 
ranging from imaging to online advertisement and financial trading we are interested in making 
inference and predictions under a fixed time budget. Efficient learning from these large and high¬ 
dimensional datasets poses new challenges 

• What algorithms should we use under a fixed time budget? 

• How much of the data should we use? Should we use all of the data or just parts of it? 

• How many passes (or iterations) of the algorithm is required to get to an accurate solution? 

At the heart of answering these questions is the ability to predict runtime of optimization algorithms 
as a function of the required accuracy and the size of data. That is, we need to understand precise 
tradeoffs between run time, data size and accuracy of various optimization algorithms. 

In this paper we characterize sharp time-data tradeoffs for optimization problems used for 
solving linear inverse problems. The last decade has witnessed a flurry of activity in understanding 
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when and how it is possible to solve such problems. Based on this growing literature a clear picture 
has emerged as to when convex programming techniques are able to find a unique structured solution 
to under-determined linear systems with generic coefficients. Often a convex function is minimized 
subject to linear measurement constraints where the convex cost function captures the “structure” 
complexity of the unknown signal. The state of the art literature on this topic characterizes a 
structure-data complexity trade off, in which the number of linear measurements or data complexity 
is related to a quantity which represents the structural complexity of the signal. Indeed, there is a 
precise curve which characterizes the minimum number of measurements or data as a function of 
the structural complexity of the unknown solution. This curve is an asymptotic phase transition 
of sorts for convex programs, on one side of this curve convex programming succeeds in exactly 
recovering the unknown signal on the other side it fails with high probability. Thus providing a 
precise tradeoff between data complexity (number of measurements) and structural complexity. 

Despite the tremendous amount of progress on data-structure tradeoffs for convex programs 
much less is known about the role played by computation. There are, of course, various ways to 
solve a given convex program (e.g. interior point methods, gradient descent, etc). However, the rate 
of convergence of various convex solvers as well as how this rate depends on the data complexity 
(number of measurements) is not well understood. 1 Confounding the matter even further the 
convex cost functions used in linear inverse problems are often non-smooth and not strongly convex 
so that crude bounds on the rate of convergence of these solvers as predicted by traditional convex 
optimization literature are often very pessimistic and rather far from the empirically observed rates 
of convergence. 

Convex cost functions are not the only way to impose structure. Indeed in some applications it 
may be more appropriate to use non-convex cost functions or deploy greedy algorithms to accurately 
capture the structure of the unknown signal. This paper presents a unified theoretical framework 
for convergence rates of projected gradient methods for finding structured solutions to under¬ 
determined linear equations. Our theoretical framework is very general and covers projection on 
both convex and non-convex sets. 

For convex sets we precisely characterize rates of convergence of various projected gradient 
algorithms as a function of the ambient dimension, the number of linear measurements and the 
structural complexity of the solution set. We prove that our rates of convergence for these algo¬ 
rithms come with explicit constants that are rather sharp for linear inverse problems. Indeed, based 
on our detailed numerical results our rates are loose by at most a factor of 1.18. As a result our 
work suggests very precise time-data tradeoffs for linear inverse problems. 

We also characterize the rates of convergence of projected gradient algorithms to the unknown 
structured solution even when the structure is best characterized by a non-convex set. Examples 
include imposing sparsity via or £ p balls with p < 1 that are known to be non-convex. Establish¬ 
ing sharp convergence rates to the unknown solution for non-convex sets is particularly challenging. 
Indeed for non-convex sets (unlike convex sets) it is not even clear that such projected gradient 
algorithms should be able to find a sensible solution. Such algorithms originate from non-convex 
optimization problems and one may expect them to get trapped in local optima. Perhaps unexpect¬ 
edly, we show that this is not the case and even these non-convex projected gradient algorithms 
converge to the unknown signal. Furthermore, as in the convex case we provide sharp rates of 
convergence for these algorithms. 


1 We would like to point out that there are a few recent papers that address computational issues and we shall review 
this literature in greater detail in Section 4. 
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1.1 Structured signal recovery from linear measurements 

We wish to discern an unknown but “structured” signal x e IR n from m linear measurements of the 
form y = Ax. A e |R mxTl is the measurement matrix and y e IR m are the measurements. We wish to 
estimate the unknown signal x from such linear measurements. However, in the applications of our 
interest typically the number of equations m is significantly smaller than the number of variables n 
so that there are infinitely many solutions obeying the linear constraints. However, it may still be 
possible to recover the signal by exploiting knowledge of its “structure”. To this aim, let / : IR n -> IR 
be a cost function that reflects some notion of “complexity” of the “structured” solution. It is 
natural to use the following optimization problem to recover the signal. 

x = argmin f(z) subject to y - Ax. (1.1) 

zeR n 

In fact in practical applications there is also some noise in our measurements and we observe 
noisy samples 


y = Ax + w, (1.2) 

where w e IR m represents the noise. It is then natural to modify (1.1) and solve 

\\y - Az\\p 2 subject to f(z)<R , (1.3) 

where R is a tuning parameter. A standard approach to solve this problem is via projected gradient 
descent. In particular, define the set 

/C = {zeR n : f(z)<R}. 

Throughout we shall use Vk :(z) to denote the Euclidean projection of z onto the set 1C. That is, 

Vjc(z) = arg min \\z - z\\j . 

ztK 

Starting from an initial point zq (often set to 0). The Projected Gradient Descent (PGD) algorithm 
proceeds as follows 


z T+ i = Vk (z t + yA* (y - Az r )). (1.4) 

We note that without the projection step this iteration update is just gradient descent on the 
quadratic objective with learning parameter y. For the sake of exposition, throughout most of this 
paper we shall assume that the tuning parameter is optimally tuned so that R = f(x). We shall 
see later on in Section 2.3 how to relax this assumption. 

1.2 Precise rates of convergence using cone-restricted eigenvalues 

We wish to characterize the rates of convergence for the projected gradient update (1.4). Using no 
prior information, standard analysis suggests that this method will converge for convex / at a rate 
of 0(1/t) where r is the number of iterations. For non-convex functions /, it is not immediately 
clear that this iteration will converge at all. To get a sense of what properties of the function / 
and the structure of the signal affect the rate of convergence let us start with a simple example 
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where there are more measurements than unknowns. Also assume that the signal has no structure 
or equivalently f(z ) = 0 for all z. In this case the update takes the form 


z r+ i = z T + p.A*(y - Az t ). 


The latter is equivalent to 


z T+ 1 - x = (I - pA* A)(z t - x) - pA* w. (1.5) 

This formula implies the following naive convergence guarantee 

||z T+ i - x\\ e2 < || I - pA*A\\ \\z T - x\\ h + p || A\\ \\w\\ h . (1.6) 

This simple example suggests that the spectral norms il - pA* A|| and |A|| play an important 
role in the convergence of the problem. However, in the under-determined case these rates are 
rather naive. In fact, \\I - pA* A j| may no longer be smaller than one, so that one may expect the 
iterates to diverge. However, in the underdetermined and structured case the error z T - x is not 
an arbitrary vector. Therefore, even though the spectral norms of these matrices are large the gain 
of these iterates when acting on the error term z T - x may not be too large. This suggests that 
some form of singular values (perhaps restricted to a particular set) may still play a role even in 
the underdetermined case. To arrive at the right form of these sets and singular values we need a 
few definitions. 

Definition 1.1 (Descent set and cone) The set of descent of the function f at a point x is 
defined as 

V f (x) = {h: f(x + h)< /(®)|. 

The cone of descent is defined as a closed cone Cj(x) that contains the descent set, i.e. Vf(x) c 
Cf(x). The tangent cone is the conic hull of the descent set. That is, the smallest closed cone 
Cf(x ) obeying Vf(x) cCf(x). 

The set Vf(x) is the set of feasible (non-ascent) perturbations of /(•) at x and the descent cone 
C contains all such directions. The reason the descent cone plays an important role in our results 
is due to the fact that the errors in our iterates belong to the tangent cone. This suggests that we 
need to look at the singular values of the matrices I - pA* A and A restricted to this tangent cone. 
Indeed, the next theorem establishes such a result. 

Theorem 1.2 Let x be an arbitrary vector in IR n , f ■ R n R a proper function 2 , and Cf(x ) be 
a cone of descent of f at x and set C = Cf(x). Suppose A e [R mxn { s a Gaussian map and let 
y = Ax + w e IR m be m linear noisy samples. To estimate x, starting from a point z$, we apply the 
PGD update 


z T +1 = Vk.{z t + pA*(y - Az t )), 


2 A 


proper function is a function whose sub-level sets are closed. 
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with K. = {z € [R n : f(z) < f(x)}. Let Kf be a constant that is equal to 1 for convex f and equal to 
2 for non-convex f. Starting from the initial point z$ - 0 the update (1.4) obeys 

Nr - x \ t2 < («/ • p(p)Y H* a + N( A ) W w \\e 2 • ( L7 ) 

Here p(p) is the convergence rate defined as 

p(p) := p(p,A,f,x) = sup u* (I - pA* A) v, 

u,veCf(x)r\B n 

and A) is the noise amplification factor defined as 

'■= £u( A ’f’ x ’ w ) = L' sup v*A*—^—. 

vsCf(x)nB n II ^ II 

This theorem establishes a counter part to the simple results we mentioned for the overdetermined 
case (1.6). The only change is that the spectral norms of I - pA* A and A are replaced with 
restricted versions. Indeed, for properly chosen values of the learning parameter p the convergence 
rate p(p) can be significantly smaller than JJ -/iA*A||. In particular when p(p) is smaller than 
one (smaller than 1/2 in the non-convex case) the above theorem establishes a geometric rate of 
convergence even in the underdetermined case regardless of whether / is convex or non-convex. 

We would like to mention that our results is much more broadly applicable than quadratic 
cost functions g(z) = 7 \\y - Az\\^ 2 . Indeed, focusing on the noiseless case w = 0 and any twice 
continuously differentiable cost function g(z). Our results holds for this case as well with the 
convergence rate 

p(p) ■■= p(p,g,f,x) = sup u* (I - pV 2 g(z))v. 
u,veCf(x)nB n ,ze[R n 

We defer study of this more general case to a future publication. 

So far we have described the rate of convergence in terms of the cone-restricted eigenvalues. At 
this point however it is still not clear how the rate of convergence p(p) and the noise amplification 
factor £ m (A) precisely depends on the quantities of interest such as the number of measurements, 
the signal structure, the function / and the learning parameter. 

In this paper we shall focus on a variety of random ensembles for the measurement matrix A. 
For these random models what affects the convergence rate p(p) and noise amplification factor 
^(A) is the size of the tangent cone Cf(x). The reason this descent cone plays an important role 
in our results is due to the fact that the difference between the PGD iterates of (1.4) and the signal 
x lie in the tangent cone. As a result it is natural for the success of projected gradient descent to 
depend on the “size” of this set. The larger the descent cone the faster the rate of convergence. 
There are of course different ways to characterize the “size” of a set. We shall use the notion of 
Gaussian width defined below to characterize the size of the set. 

Definition 1.3 (Gaussian width) The Gaussian width of a set C e IR n is defined as: 

u(C) := E g [sup (g,z)], 

zeC 

where the expectation is taken over g ~ A/"(0, I n ). 


5 




We will use the Gaussian width of the intersection of the descent cone and the unit sphere (uj(Cf(x)n 
B n )) to quantify the “size” of the tangent cone. Interestingly, this parameter also characterizes the 
“complexity” of our unknown signal x and is often directly related to the number of parameters 
or degrees of freedom of the structured signal. For example, when using f(z) = || z\^ for a signal 
with s non-zero coefficients u> ~ 2slog(n/s). 

It is known that in the noiseless version of the problem (1.1) leads to exact recovery when the 
number of measurements is sufficiently large i.e. m > mo- This minimal number of measurements mo 
is roughly equal to square of the mean width of the tangent cone of / at x , i.e. mo ~ oj 2 ( Cf(x ) n £>“) 
[3,24], This simple formula provides a precise relationship between number of measurements (data 
complexity) and the structural complexity of the unknown signal. In the next section we will see 
that for random measurement matrices both the rate of convergence and the noise amplification 
factor can also be bounded rather precisely in terms of the number of measurements m and this 
notion of structural complexity (mo); providing precise tradeoffs between the rate of convergence, 
the number of measurements and the structural complexity of the signal. 

In Section 2 we develop sharp bounds for convergence rates of projected gradients. Then in 
Section 3 through extensive simulations we confirm that our theoretical bounds are rather sharp 
both in terms of sample and computational complexity. In Section 4 we review the vast amount of 
related literature on this topic. We discuss our results and some future directions in Section 5. We 
end the paper by proving our results in Section 6. 

2 Theoretical results for projected gradient descent 

In this section we shall explain our main theoretical results. More specifically we wish to precisely 
characterize the rate of convergence p(/i) and the noise amplification factor of Theorem 1.2 
for different random ensembles and different values of /r. Our focus will be on projected gradient 
descent (PGD). 

As we mentioned earlier we are interested in solving structured linear inverse problem via 
optimization of the form 

^ 111/ - Az\\g 2 subject to f(z)<R, (2.1) 

with R = f(x). As we shall see later in Section 2.3 this assumption can be relaxed. 

Naturally the convergence/lack of convergence as well as the rate of convergence of projected 
gradient descent depends on the learning parameter /r. A large value of the learning parameter 
will lead to faster convergence to the optimal solution. However, if the learning parameter is too 
large projected gradient may not converge to the optimal solution or may converge only if the 
number of measurements are large in comparison with the minimal number of measurements as 
dictated by the structural complexity of the unknown signal. On the other hand projected gradient 
descent may still converge with a smaller choice of learning parameter, albeit at a slower rate, 
even when the number of measurements is not too high. In short, for a fixed value of structural 
complexity we expect a tradeoff between computational and data complexity. Larger choices of the 
learning parameter increases the speed of convergence of the algorithm but require a higher data 
complexity (number of measurements) for convergence. Whereas smaller choices of the learning 
parameter are more efficient in terms of the data complexity but have slower convergence rates. To 
provide a rigorous understanding of such tradeoffs we study three different regimes for the learning 
parameters. 
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• A greedy choice of learning parameter of size of the order y « 1/m. This leads to a linear rate 
of convergence with a sufficiently large number of measurements. 

• A conservative choice of learning parameter of size of the order y ~ 1/re. The convergence 
rate in this case is geometric but achieves the best sample complexity (in fact this sample 
complexity is sharp for convex functions). 

• A structure dependent choice of learning parameter which depends on the structural com¬ 
plexity of the unknown signal. This choice offers a tradeoff between the previous two choices. 
It achieves a linear rate but at a reduced sample complexity compared to the greedy choice. 


2.1 Gaussian maps 


The main focus of this paper is on Gaussian maps. In particular we assume A e [R mxn has in¬ 
dependent A7(0,l) entries. We shall explain how our results can be generalized to non-Gaussian 
maps in Section 2.4. In all of our results we shall make use of the following definition for the phase 
transition function which characterizes the minimum required number of measurements. 


Definition 2.1 (phase transition function) Let x be an arbitrary vector in IR n , / : [R n -> IR be 
a proper function, and Cf(x ) be a cone of descent of f at x and set u = u(Cf(x) n £>"). Also let 


4>(t) = \/2 


W) 

r (|) 


\ft. We define the phase transition function as 
M(f,x,y) = 0 _1 (u; + 7?) * (w + r/)' 2 . 


We shall often use the short hand mo = A i(f,x,rj) with the dependence on f,x,rj implied. We note 
that for convex f based on [3, 24] mo is exactly the minimum number of measurements required 
for the program (2.1) to succeed in recovering the unknown signal x with high probability (in the 
absence of noise). With some overloading, even for non-convex f, we shall refer to mo as “phase 
transition” or “minimum number of measurements”. 


2.1.1 Linear rates of convergence via a greedy learning parameter 

In this section, we shall focus on a greedy learning parameter which is roughly of size y ~ 1/m. This 
greedy step size allows us to ensure a linear rate of convergence to the optimal solution with near 
minimal number of samples. In the theorem below and throughout we shall use the short-hand 

b m = 4>(m) = \/2 » y/fn. 

Theorem 2.2 Let x e IR n and w e IR m be arbitrary vectors and f : IR” -> IR be a proper function. 
Suppose A e IR mxn is a Gaussian map and let y = Ax + w e IR m be m linear noisy samples. To 
estimate x, starting from a point zq, we apply the PGD update 

z T+ i = V)c(z T + yA*(y- Az r )), (2.2) 

with 1C = {z e [R n : f(z) < f(x)}. Let Kf be a constant that is equal to 1 for convex f and equal to 
2 for non-convex f. Set the learning parameter to y = p- « —. Furthermore, let mo = A 4(f,x,rj) 
defined by 2.1 be the minimum number of measurements required by the phase transition curve. 
Then as long as 

m>8Kjmo, (2-3) 
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starting from the initial point zq = 0 the update (2.2) obeys 




( 8K 2 mo \ : 

m ) 


\X\\n + \ ~ - 

Wii v 2 . 


K / 


\A«o 


J 8k 2 ,^ m 
V / m 


to 


<?2 ’ 


with probability at least 1 - 9e 8 . 


(2.4) 


We would like to point out that for convex functions / in fact a sharper convergence result is possible. 
This follows from a more careful analysis which we detail in Section 6.3.5 (more specifically plugging 
/i = in (6.25)). That is, under the assumptions and setup of Theorem 2.2 for convex functions 
/ we can show that as long as 


7 + 3^5 

m >-mo ~ 6.85mo, 


then (2.4) can be replaced with 

7 + 3\/5 mo 


x h 2 < 


17 + 3\/5 mo \ ; 

\ 2 m / 


x 


^2 


+ Vl (3 + y5 ) . 


1 


\/rno 


W 


MU 


7+3\/5 rn 


ii ■ 


To parse Theorem 2.2 first let us consider the noiseless case where w = 0. The first interesting 
and perhaps surprising aspect of this result is its generality: It applies not only to convex cost 
functions but also to non-convex ones! As we mentioned earlier the optimization problem in (1.1) 
is not known to be tractable for non-convex /. Regardless, the theorem above shows that with a 
near minimal number of measurements, projected gradient descent provably converges to the global 
optimum of (1.1) without getting trapped in any local optima. 

We note that when condition (2.3) is satisfied the convergence rate of projected gradient descent 


(■y/fW^mo/m) is guaranteed to be strictly less than a constant smaller than one and therefore the 


algorithm will converge with a linear rate. In particular the number of iterations required by the al¬ 
gorithm to get within a relative error of e (\\z T - x\ i2 / ||z||^ 2 < e) is roughly of size 21og(^)/(log ^). 
This should be contrasted with the best known rates 1/e and l/\/e obtained by traditional results 
on non-snrooth convex optimization. 

As we mentioned earlier [3,24] characterized the precise number of measurements required for the 
optimization problem (1.1) to succeed when using a convex cost function /. This minimal number 
of measurements was a function of the structural complexity, described by the phase transition 
function mo = </>(w) « cu 2 . The above result shows a linear convergence rate when the number 
of measurements are larger than m > 6.85mo for convex functions and m > 32mo for non-convex 
cost functions. That is we get linear convergence to the global optimum when the number of 
measurements are a small constant factor times the phase transition. 

Another intriguing aspect of the above result is that it suggests an interesting tradeoff between 
sample complexity (number of measurements) and computational complexity. To see this note that 
the larger the number of measurements as compared with the phase transition function the faster 
the rate of convergence in (2.4). More precisely if m > 8c«; 2 mo the converge rate is of the form 


2 t - X \\ i2 < 




X 


h ■ 












Therefore, for fixed structural complexity more samples leads to a faster convergence rate which 
to some extent leads to a smaller computational complexity and vice versa. The further away we 
are from the phase transition the faster the convergence rate. A natural question at this point 
is whether the constant in front of the phase transition function in (2.3) is sharp. Naturally this 
constant depends on the learning parameter. We shall show in later sections that other choices 
of the learning parameter allow for smaller constants. This decrease in terms of data complexity 
however leads to slightly slower rates of convergence which in turn leads to higher computational 
complexity. Regardless, we would like to note that our numerical simulations in Section 3 indicate 
that for the choice of learning parameter /r r> ^ our results are a small constant away from the actual 
data complexity required by the algorithm. For example for promoting sparse structures via the 
i\ norm the minimal number of measurement required for the PGD update to converge is roughly 
around 5.8mo when the sparsity level s is equal to the signal dimension n. This is rather close to 
the prediction m > 6.85mo of (2.3). Interestingly the constant in front of is strictly larger than 

one. In fact, for fully dense signals (s = n) we can show p(l/m) ~ yj(y/2- 1 )~ 2 n/m ~ y/5.8n/m 
as n -> oo which demonstrates that the number of measurements must obey m > 5.8 n & 5.8mo to 
ensure p(l/m) < 1. This suggest that while the choice p « leads to a linear rate of convergence 
when m is a constant factor away from the phase transition, this choice may be too greedy for 
convergence slightly above the phase transition. In order to get convergence closer to the phase 
transition we shall study other choices for the learning parameter in the coming sections. 

Finally, in the presence of noise the algorithm converges to a solution which is within a small 
radius of the structured signal. This radius of convergence can be shown to be min-max optimal. 
That is, no algorithm can lead to a solution that is significantly closer to the structured solution. 
We also remark that the residual term in (2.4) has the exact same form (up to small and explicit 
constants) as the residual term one would get from analyzing (2.1) [14,72]. 

2.1.2 Geometric convergence above the phase transition via a conservative learning 
parameter 

In this section, we shall focus on a conservative learning parameter which is roughly of size fi ~ 1/n. 
This conservative step size allows us to ensure convergence with minimal amount of samples for 
convex functions. 

Theorem 2.3 Assume the same setup as Theorem 2.2. Furthermore, assume the penalty function 
f is convex and set the learning parameter to 


0.99 



Then as long as 


m > mo, 


the update (1.4) obeys 



— ZL_ 

with probability at least 1 - 2 e 2 - e~ T " with 7 a fixed numerical constant. 
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This theorem shows that exactly above the phase transition m > mo, projected gradient descent 
converges to within a small radius of the structured solution at a geometric rate. While the rate 
of convergence is geometric it is not linear. Indeed, the rate of convergence slightly above the 
phase transition is roughly of size 1 - O(f^) so that for a relative error of e, the required number of 
iterations is of the order of 2^ log (/ ) /(log^/). Comparing this result with that of the previous 
section we note that the computational complexity has increased by a factor of size roughly n/m. 

2.1.3 Structure dependent choice of learning parameter 

In the previous two sections we saw convergence results for two learning parameters. These learning 
parameters while dimension dependent did not depend in anyway on the structure of the unknown 
solution. A natural question is whether a choice of learning parameter that depends on the structure 
of the signal can lead to improved convergence results. This is the subject of the next theorem. 

Theorem 2.4 Assume the same setup as Theorem 2.2. Furthermore, assume the penalty function 
f is convex and set the learning parameter to 

_ 3 

m2- ^2 motyfn-(y/m-y/rm) 2 
h 2 m (m 0 - 2^/mmo + 2m) 


Then as long as 


m > 4mo, 


( 2 . 6 ) 


starting from zq = 0 the update (1.4) obeys 

M5))’ 


\\Zt-x \\ l2 < 




4 \Z2tt yfmf) 


w 


m 


h ’ 


(2.7) 


with probability at least 1 - 3e « - e 2 . Here, 


4>(t) = 2 


(>/ 2 (i 

(7-2T7 + 2) 2 


1 - a/47. 


As with our results in Theorem 2.2 the above theorem establishes a linear rate of convergence 
with near minimal number of measurements. We would also like to point out that similar to our 
previous results the rate of convergence in (2.7) exhibits the right behavior in the sense that the 
larger the number of measurements (the smaller 7), the faster the rate of convergence. However, 
the constant in front of the phase transition function in this theorem is sharper by a factor of 2. 
Our numerical results in Section 3 show that this constant is sharp. That is, with the learning 
parameter as chosen in (2.5) the minimal number of measurements required for the PGD update to 
converge to the actual solution is very close to 4mo which is precisely the value predicted by (2.6). 
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2.2 Lower bounds on the convergence rate 

We already explained that based on our numerical simulations in Section 3 it seems that using 
a learning parameter of size p ~ 1 /m, projected gradient descent will not converge slightly above 
the phase transition. Therefore, a more conservative choice of the learning parameter (as in this 
section) is required to guarantee converge in this “data poor” regime. A natural question at this 
point is whether it is possible to have a linear convergence rate using this conservative choice of 
learning parameter, specially in the “data poor” regime (slightly above the phase transition). While 
Theorem 2.3 above shows that a geometric rate is possible, the theorem below refutes the possibility 
of a faster linear rate when the learning parameter is on the of order 1 /n. 

Theorem 2.5 Consider the setup of Theorem 2.3 with w - 0 and assume that /(•) is convex. For 

2 

any e > 0 , there exists a radius r(e) such that, with probability l-exp(--^-), starting from any point 
zq satisfying \\zq -cc ||^ 2 < r(e), the PGD updates (1.4) obey 

\\z T -x\\ h > (1 - e) T • max (l -//(\/m + ^mo) 2 , 0 ) r \\z 0 - x\\ h . ( 2 . 8 ) 

When m > mo, setting the learning parameter to p = O(^) in (2.8) guarantees a lower bound on 
the convergence rate of size 1 - 0( —). This also suggests that, if one wishes to obtain an o(l) 
convergence rate, than the learning parameter should indeed be > —, which is consistent with the 
greedy choice of Theorem 2.2. 


2.3 Stability to tuning parameters 

So far we have assumed that the parameter R is tuned perfectly and is set to R = f(x). Of course 
in practice f(x ) is not known in advance. In this section we shall explain how our results on the 
performance of projected gradient descent change if R± f(x) in (1.3). For the sake of exposition we 
shall focus on the noiseless case and assume w = 0. As it will become clear in our proofs our results 
easily generalize to the noisy case. Also we shall only state our results in the setup of Theorem 2.2. 
All of our results easily generalize to the setup of other theorems. 


Theorem 2.6 Let x e IR n be an arbitrary vector in IR n and f : IR” -> IR be a proper function. 
Suppose A e [R mxn i s a Gaussian map and let y = Ax e IR m be m linear samples. To estimate x, 
starting from a point zq, we apply the PGD update (1.4) with 1C = {z e IR n : /(z) < R}. Let Kf 
be a constant that is equal to 1 for convex f and equal to 2 for non-convex f. Set the learning 
parameter to p = p- » Furthermore, let mo = A4 (f,x,y) defined by 2.1 be the minimum number 
of measurements required by the phase transition curve. Then as long as 

m>8Kjmo, (2-9) 

starting from the initial point zq = 0 the update (1.4) obeys the following condition for all R < f(x), 


x 


k * P " 


II X Wi2 + 




1 ~ P 


x ■ 


'Pk(x) \\ i2 , 


with 



( 2 . 10 ) 


_ ZL_ 

with probability at least 1 - 8 e s . Furthermore, if f is absolutely homogenous starting from the 
initial point zq = 0 for all R > f(x) the update (1.4) obeys 


z T - a;||^ 2 < p T \\x 


+ 


Kf + 1 + 2p / R \ 

1 ~P [ /(*) / 



with 



( 2 . 11 ) 
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with probability at least 1 - 8e « . i/ere m' 0 - M.(f, v) 3 

Comparing Theorem 2.6 with its counterpart in Theorem 2.2 we see that there is an extra term 
due to the mismatch between the tuning parameter R and /(*)• This extra “mismatch error” goes 
to zero as R -»■ f(x). This is of course natural as if we do not know R precisely we can not hope 
to recover the signal x exactly. In fact we would like to note that this mismatch term takes a near 
optimal form. In particular when R < f(x) one can not hope to recover a signal that is closer 
to x than Vfc(x). Therefore, the smallest error one can hope for is \\x - Vjc{x)\\g . Interestingly, 
(2.10) shows that one can get to this near minimal error via projected gradient descent. Another 
interesting aspect of the theorem above is that the mismatch error terms exhibit the correct behavior 
in terms of the data complexity. Larger number of measurements lead to smaller values of the rate 
p which in turn lead to a smaller mismatch error. 

2.4 Non-Gaussian maps 

Our results are not restricted to Gaussian ensembles and apply to a wide variety of random mea¬ 
surements. We shall state our results for non-Gaussian maps in the setup of Theorem 2.2 and in 
the noiseless case (w = 0). All of our results easily generalize to the setup of other theorems as well 
as noisy measurements. We will discuss two general class of random ensembles. One general class 
of ensembles are isotropic sub-Gaussian matrices. 

Definition 2.7 (Isotropic Sub-Gaussian (ISG) matrices) A e [R mxn { s a A-subgaussian ma¬ 
trix if its rows are independent of each other and for all 1 <i <m the ith row a* satisfies 

• E[a.j] = 0. 

. E [ ai a*]=I n . * 4 

• For any vector v, P(|a*u| > t |MI^ 2 ) ^ exp(-^). 

Theorem 2.8 Let x be an arbitrary vector in P n , / : P n -> P a proper function. Suppose A e P mxn 
is an Isotropic sub-Gaussian map with parameter 5 and let y = Ax e P m be m linear samples. To 
estimate x, starting from a point z$, we apply the PGD update (1.4) with K = {z e P n : f(z) < 
/(*)}. Set the learning parameter to y= Furthermore, let mo = M(f,x,rj) be defined by 2.1 be 
the minimum number of measurements required by the phase transition curve. Also assume that 


m > ca • mo, 


( 2 . 12 ) 


for a fixed numerical constant ca depending only on A. Then, there exists a constant c\ > 0 and an 


event of probability at least 1 - e ClV 
the update (1.4) obeys 


such that on this event, starting from the initial point zq = 0 


z r - x 


ti 


( m 0 \ ■ 
c A — 

V ml 


\\x\ 


ti ■ 


(2.13) 


,? Note that if / is a norm, m' 0 = mo. 

4 The assumption that the population covariance matrix is identity is not necessary. In fact our result generalizes 
to any covariance matrix. The only change is that the required number of measurements will now scale with the 
condition number of the covariance matrix. 
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Results with sub-Gaussian ensembles provide useful theoretical insights. In many practical appli¬ 
cations other measurement ensembles may be of interest. Indeed, measurements that are physically 
realizable in most signal processing applications (e.g. MRI) are based on Fourier ensembles. Also, 
sub-Gaussian ensembles are dense and do not have fast multiplication. So it is desirable to have 
measurement ensembles where applying A and its transpose can be implemented with a fast algo¬ 
rithm (e.g. applying the discrete Fourier transform via the FFT algorithm). Our results also apply 
to a wide variety of ensembles that are physically implementable and have fast multiplication. 

Definition 2.9 (Subsampled Orthogonal with Random Sign (SORS) matrices) Let F e 

IRnxn d eno t e an orthonormal matrix obeying 

F*F = I and max|FR| < ——. (2-14) 

id \Jn 

Define the random subsampled matrix H e [R mxn with i.i.d. rows chosen uniformly at random from 
the rows of F. Now we define the Subsampled Orthogonal with Random Sign (SORS) measurement 
ensemble as A = HD, where D e |R rixn is a random diagonal matrix with the diagonal entries 
i.i.d. ±1 with equal probability. 

To simplify exposition, in the definition above we have focused on SORS matrices based on sub¬ 
sampled orthonormal matrices H with i.i.d. rows chosen uniformly at random from the rows of an 
orthonormal matrix F obeying (2.14). However, our results continue to hold for SORS matrices 
defined via a much broader class of random matrices H with i.i.d. rows chosen according to a prob¬ 
ability measure on Bounded Orthonormal Systems (BOS). Please see [43, Section 12.1] for further 
details on such ensembles. 

Theorem 2.10 Consider the same setup as Theorem 2.2 using a Subsampled Orthogonal with 
Random Sign (SORS) matrix. Also assume that 

m > ca(t] + l) 2 ■ (logn) 4 ■ mo, (2.15) 

for a fixed numerical constant ca depending only on A. Then, there is an event of probability at 
-a 2 

least l-4e s , such that on this event, starting from the initial point zq = 0 the update (1.4) obeys 

r 

\\z t -x \\ £2 < ^c A (r/+ l) 2 ^ log 4 nj 2 ||*||^ . (2.16) 

Theorems 2.8 and 2.10 above extend Theorem 2.2 to the class of ISG and SORS matrices albeit at 
a loss in terms of the constants (log factors). We would like mention that our theoretical results 
apply to a more general class of matrices. Indeed as it will become clear in the proofs (Sections 6.5 
and 6.6) (2.15) and (2.16) continue to hold for any matrix obeying the following isometry property. 

max11| Av||\ - ||u||» | < A, (2-17) 

veS 2 2 

where 

5 = C_UC+, with C- = Cf(x) n S n_1 -Cf(x) n S n_1 , C + = C f (x) n S™' 1 + C f (x) n S n "\ 

and the columns of A are properly normalized to have Euclidean norm equal to one. We emphasize 
that finding maps satisfying (2.17) is an active and exciting area of research. 
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Figure 1: Phase transition phenomenon for projected gradient iterations. 

This diagram shows the empirical probability that G-projected gradient with y = 1/m 
successfully identifies a vector x e |R n=100 ° with s nonzero entries from m random 
measurements of the form y = Ax. Here, A e R mxn is a Gaussian matrix with entries 
i.i.d. A^(0,1). The colormap tapers between red and blue where red represents certain 
success, while blue represents certain failure. 


3 Numerical results 

In this section we corroborate the theoretical findings of the previous sections via experiments on 
synthetic data as well as a few experiments on natural images. We begin with some synthetic 
experiments. 

3.1 Synthetic simulations 

In the simulations of this section we focus on sparse recovery problems using various cost functions 
/. We run our simulations with signals of fixed dimension n = 1000. We however vary the sparsity 
level s, number of measurements m, and the learning parameter y. In our experiments we also 
use two different random ensembles for the sensing matrix A e [R mxn : Gaussian with i.i.d. entries 
1) and Bernoulli with i.i.d. entries taking values 0 and 1 with equal probability. 

3.1.1 Phase transitions 

The aim of the simulations of this section is to explore how the phase transition curve changes with 
different functions / and different learning parameters y. We use random signals x € IR n and random 
sensing matrices A e IR rnxn . The support of the signal is of size s and is chosen at random with 
the values on the support distributed i.i.d. A/"(0,1). We use y = Ax e IR m as our measurements and 
apply the PGD update with various learning parameters y starting from z o = 0. We stop after 500 
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Figure 2: Phase transition curve with a greedy learning parameter. 

This diagram shows the empirical phase transition curve for ^-projected gradient with 
fi = 1/m for successful recovery of a vector x e |R n=100 ° with s nonzero entries from m 
random measurements of the form y = Ax. Here, A e |R mXTl is a Gaussian or Bernouli 
random matrix. We demonstrate results using £o>^i/ 2 > and t\. We also draw the 
prediction of Theorem 2.2 for G-PGD for comparison. We see that the prediction of 
Theorem 2.2 is accurate up to a multiplicative factor of 1.18. 


iterations and record the empirical probability of success. The empirical probability of success is an 
average over 50 trials, where in each instance, we generate new random signals and measurement 
matrices. We declare a trial successful if the relative error of the reconstruction ||® - ®||^ / ||*||^ 
falls below 1CT 3 . 

Figure 1 depicts the empirical success probabilities via a color map for different sparsity levels 
s and number of measurements m. Red represents certain success, while blue represents certain 
failure. In the experiments of this figure the measurements are Gaussians, the objective function is 
f(z) = ||z||£ and the learning parameter is set to y = 1/m. This curve clearly shows that there is a 
phase transition curve for the number of measurements as a function of the sparsity. On one side 
of this curve PGD updates is successful with high probability on the other side it fails with high 
probability. In future depictions of phase transitions in this section we shall only plot the phase 
transition curve as found by a boundary detection routine in lieu of the complete colormap. 5 

We now turn our attention to comparing the phase transition curves obtained for the sparse 
recovery problem with different cost functions and learning parameters. We shall use fp-norms with 
p= 0,1/2,1 as our cost functions (i.e. f(z ) = ||z||^ o , ||;z||^ 1/2 , and Hz]]^). 6 In our experiments we use 
both binary and Gaussian matrices. 

We first consider the greedy learning parameter /j = ^ and show the empirical phase transition 

5 In particular MATLAB bwtraceboundary function. 

6 We note that the proximal function of £i/ 2 has a closed form solution e.g. see [96]. We have utilized this closed form 
solution of the proximal function to implement projection on to the one-half ball. 
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(a) Phase transition curve with a conservative learning parameter. This diagram shows 
the empirical phase transition curve for ^-projected gradient with /i = 1/ (•y/m+ s/n) for successful 
recovery of a vector x s |R rl=1000 with s nonzero entries from m random measurements of the form 
y - Ax. Here, A e R m>m jg a Gaussian or Bernouli random matrix. We demonstrate results using 
^Ojfu/2, an d ^i- We also draw the prediction of Theorem 2.3 for G-PGD for comparison. We see that 
the prediction of Theorem 2.3 is a near perfect match with the empirical simulations. 

Gaussian matrices Binary matrices 




n n 

(b) Comparison of different learning parameters y for ^-PGD. 

This diagram shows the empirical phase transition curve for fi-projected gradient with three different 
learning parameters: greedy choice y - 1/rn, a conservative choice y = 1/ (^/?7^ + >/n) as well as the 
structure dependent choice of Theorem 2.4. This curves are about recovery of a vector x e [R" =1000 with 
s nonzero entries from m random measurements of the form y = Ax. Here, A e R mxrl jg a Gaussian or 
Bernouli random matrix. We also draw the prediction of Theorem 2.4 for fi-PGD for comparison. We 
see that the prediction of Theorem 2.4 is a near perfect match with the empirical simulations. 
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curves in Figure 2. These figures show that the phase transition curves for Bernoulli and Gaussian 
matrices are essentially identical. This is in line with the universality of phase transitions observed 
previously, e.g. see [32], These plots also suggest that non-convex cost functions may require fewer 
measurements for successful recovery. While the curves are close to each other, this experiment 
shows that fi^-PGD requires the smallest sample complexity and £o~PGD outperforms ^i-PGD. 
This is perhaps surprising as one would expect p = 1/2 to perform somewhere between p = 0 and 
p = 1. We also plot our theoretical prediction for the phase transition for p = 1/m (equation (2.3) 
from Theorem 2.2). We observe that the empirical phase transitions and our theoretical predictions 
are rather close. In particular, our theoretical prediction for p = 1 is off by at most a factor of 1.18. 
Our predictions for p = 0 and p = 1 is not as sharp due to the extra factor Kf = 2 in (2.3). 

We now focus on phase transition curve for the more conservative learning parameter p = 
l/(^/m + \/n) 2 . For PGD with i\ projection, Theorem 2.3 predicts that PGD phase transition 
matches that of l \-minimization (a.k.a. the Donoho-Tanner curve [37,39]). In fact as mentioned 
earlier the statement of Theorem 2.3 holds for any convex function and not just £\. Figure 3a 
confirms this fact empirically and shows that our theoretical prediction is indeed tight. This plot 
also shows the phase transition curves of t?o~PGD and f^ 2 -PGD is below that of -fi-PGD. These 
curves confirm that starting from zero nonconvex PGD updates beat the phase transition of £\ 
minimization (i.e. the Donoho-Tanner curve)! We note that our theory for the conservative tuning 
parameter only applies to convex regularizers. Characterizing t?o-PGD or £ 1( / 2 -PGD curves via the 
conservative choice of tuning parameter p = is left to future research. 

Finally, in Figure 3b, we compare the empirical phase transitions of different choices of tuning 
parameters for PGD. In addition to the greedy (p = 1/m) and conservative (p = 1/ (\/m + ^/n)") 
choices of the learning parameter we also use the structure dependent choice of the learning param¬ 
eter (equation (2.5) of Theorem 2.4). The structure dependent choice of the learning parameter 
aimed to have a linear convergence rate (similar to the greedy choice p = 1/m) but with improved 
sample complexity. Figure 3b shows that this is indeed the case, as the phase transition curve of 
this choice of learning parameter is below that of the greedy choice. In particular our theoretical 
prediction in Theorem 2.4, is a near identical match with the empirical phase transition. The dif¬ 
ference between various choices of the learning parameter p shows that there is a tradeoff between 
convergence rate and sample complexity. The greedy choice of learning parameter leads to a linear 
convergence rate but requires more measurements for successful recovery when compared to the 
phase transition of £\ minimization (note that this phase transition is not drawn in this figure but it 
essentially an identical match with the curve of p = 1 /(\/m + - s /n) 2 ). In comparison the conservative 
choice of learning parameter does lead to exact recovery precisely above the phase transition. 
The convergence rate is no longer linear (it is however still geometric). Theorem 2.5 shows that in 
fact this behavior is sharp and when using PGD updates a linear rate of convergence can not be 
obtained exactly above the phase transition. 


3.1.2 Rates of convergence 

The aim of the simulations of this section is to explore how the rate of convergence of project 
gradient depends on the number of measurements m and the structure complexity of the unknown 
signal mo as well as the learning parameter p. Similar to the simulations of the previous section 
we use random signals x e IR n and random sensing matrices A e IR mxn . The support of the signal is 
of size s and is chosen at random with the values on the support distributed i.i.d. A/"(0,1). We use 
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Figure 3: Rate of convergence of fi-PGD for different learning parameters. 

This diagram shows the empirical rates of convergence for ^-projected gradient with 
three different learning parameters: greedy choice y = 1/m, a conservative choice 
y = 1/ (^m + yfHy as well as the structure dependent choice of Theorem 2.4. These 
curves are about recovery of a vector x e j> n = 2 °oo with s = 40 nonzero entries from 
m = 1200 random measurements of the form y = Ax. Here, A e [R mxn is a Gaussian 
random matrix. Curves show relative error of t\ projected gradient descent for 50 
iterations and show the convergence rate for 100 trials as well as the average over 
these trials in bold. We see that all three step sizes show geometric rate of convergence 
and that the larger step sizes lead to a faster convergence rate. 


y = Ax € [R m as our measurements and apply the PGD update with various learning parameters y 
starting from Zq = 0. In the experiments of this section n = 2000. 

In our experiment we shall focus on sparse recovery using t\ minimization. We run i\ projected 
gradient descent for 50 iterations and record the convergence rate for 100 trials. We depict these 
curves in Figure 3 as well as the average over these trials in bold. We see that all three step sizes 
show geometric rate of convergence and that the larger step sizes lead to a faster convergence rate. 

To investigate the nature of the dependence of the geometric rate of convergence on the number 
of measurements and the minimum required number of measurements (which in turn depends 
on sparsity) we consider different sparsity levels s = O.Oln, 0.02n, 0.05?r, O.ln, 0.2n, and 0.5n. The 
number of measurements used for each sparsity level is set to m = 8 mo, where niQ is obtained from 
the phase transition of t\ (Donoho-Tanner curve). For each sparsity level we run G-PGD with 
learning parameter y - 1/m for 50 iterations and record the relative error. In Figure 4 we show the 

__ T_ 

median relative error from 100 trials normalized by the ratio ((l~ 7 )^f ) 2 f° r different sparsity 
levels. Here, 7 = 0.075 and mo is the value obtained from the empirical phase transition for the 
greedy choice y = 1/m (the curve is depicted in Figure 1)). We see that except for a burn-in period 
across different sparsity levels and many iterations the normalized relative error is constant. This 

shows that the convergence rate scales like \ ■ This is exactly the type of behavior predicted by 

Theorem 2.2. In particular, Theorem 2.2 finds an upper bound on the convergence rate of the form 
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Figure 4: Normalized convergence rates for different sparsity levels. 

This diagram shows the empirical rates of convergence for £i-projected gradient. These 
curves are about recovery of a vector x e [R n = 2000 with s nonzero entries from m random 
measurements of the form y = Ax. Here, A e [R mxn is a Gaussian random matrix and 
different curves denote different values of s. The horizontal axis corresponds to the 
number of iterations. The vertical axis represents the median relative error from 100 

_ T 

trials normalized by the ratio ((l-y )^) 2 f° r different sparsity levels. Here, 7 = 0.075 
and mo is the value obtained from the the empirical phase transition for the greedy 
choice y = 1/m (the curve is depicted in Figure 1)). These curves confirm that the 

convergence rate scales like y — as predicted by Theorem 2.2. 


(—)a where 8 ?no is an upper bound on the empirical phase transition mo- Figure 4 indicates 
that our bounds can be sharpened in practice by replacing the upper bound 8 mo by the empirical 
phase transition rh o- 

3.1.3 More data, less work 

Our results suggests interesting tradeoffs between data and computational resources. To see this 
for the moment let us focus on Gaussian design matrices and convex /. As we mentioned in real 
applications of linear inverse problems we observe noisy samples y = Ax + w and wish to solve 

x = argmin ||y-Az ||| 2 subject to f(z)<f(x). (3.1) 

Let e be the desired relative accuracy of the optimal solution, i.e. e = ||x - *||^ / ||®||^ . It is known 
(e.g. see [72,74,84]) that for Gaussian measurements and convex / 











We can think of e as the “statistical accuracy” of our solution that is the answer we would get if we 
solved the optimization problem (3.1) with arbitrary precession. Now let us compare this answer 
with our results for projected gradient descent which states that using /r f» — after r iterations for 
m > 6.85mo we have 

"T,r lk <f6.85^) f H-^(3 + V5)-- 1 

\ x \z 2 ' 8 (l-^6.85^) m 11*1^ 

Simple manipulations imply that for m > 6.85mo we have 



The latter expression is an upper bound on the “numerical accuracy” we get when using projected 
gradient descent. Since our statistically accuracy is e, it is natural to aim for a numerical precision 
of the same order e.g. lOe. In order to guarantee this numerical accuracy, based on (3.2) the number 
of iterations must obey 

r > 2 

lo e(sik) 

This expression is rather interesting as it shows that the larger m or data complexity is, the smaller 
the number of iterations required to get to a specific numerical accuracy of lOe. To gain insight 
on the computational complexity note that in each iteration we have to apply A and its transpose 
A* once and also perform a projection onto the set /C. Using X to denote the computational 
complexity of an operation, the total computation complexity to arrive at a numerical accuracy of 
e is 


Time budget cxT’(numerical accuracy of e) 

log( —) / \ 

> 2 - - —-—r- ff(apply A) + A(apply A*) + ^(projection on to 1C) . 

kT ' ' > 

As a result we can characterize the required number of measurements as a function of the structural 
complexity (via mo), time budget and the cost of applying A, its transpose and projection onto 
the set /C. Focusing on Gaussian matrices and assuming that the projection is negligible compared 
to applying A and its transpose. We note that for many interesting problems that arise in practice 
this is true. For example, when / is the i\ norm the computational complexity of projection is 
on the order of nlogn which is less costly than applying A and its transpose. Therefore, we can 
deduce that 

computational complexity oc--——— -n. 

This formula shows that as we increase the data complexity m the computational complexity 
decreases up to a certain point (m < 21.75mo). However, if we increase the number of measurements 
further the computational complexity will start increasing again. 

The tradeoff between data and computational resources is even more interesting for SORS en¬ 
sembles specially when applying A and its transpose is independent of the number of measurements 
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m. For example, for subsampled Fourier with random sign or subsampled Hadamard and random 
sign we have 

nlogn 

_m_\ * 

c(logn) 4 ?no ' 


computational complexity oc-- 

log( 


The further we increase the data complexity the smaller the computational complexity: More data 
means less work! 

To demonstrate that this is indeed the correct behavior let us focus on the computational 
complexity when f(z) = ||z||^ and when the structured signal has sparsity s. We run projected 
gradient descent until the relative error is small, i.e. \\z T - x\ t /\x\ t2 < 1CT 3 . We use a sparsity 
level of s = 0.025n and compare the real run time of the algorithm with theoretical predictions 
of the form — j ——y and otp — y for Gaussian and Fourier with random sign ensembles. 


t( m \ 

= \/ 3 G m 0 ) 


log 


m I 

Ppm 0 ) 


The values of a and /3 are chosen to minimize the Euclidean norm of the run time as compared 
to these theoretical predictions. We note that the best choices are f3c = 4.054 and (ip = 3.5. 
Figure shows that our theoretical predictions is a near ideal match with the actual run time of the 
algorithm (This is with the caveat that we can not predict (3g and /3f precisely of course. Our 
precise theoretical prediction for /3q was 8 but as we mentioned before this constant is only an 
upper bound and not precise but interestingly only off by a factor of 1.7). As a result these curves 
confirm our earlier observation that for Gaussian ensembles increasing the number of measurements 
(or data complexity) decreases the runtime as long as m < 10.25mo but further increases in the data 
complexity will lead to an increase in the run time. For the Fourier with random sign ensemble 
however, increasing the data complexity will only lead to further decrease in the run time. More 
data, less work! 


Gaussian matrices SORS matrices 




m 0 m.Q 


Figure 5: Blue curves show actual run time of projected gradient descent in millisec¬ 


onds. Red curves shows the best scaling of the functions 


and 


log ( 4.054m 0 ) ’°g( 3.5m 0 ) 

the Gaussian and sub-sampled Fourier with random sign ensembles, respectively. 


for 
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(a) Eram Garden, Shiraz. (b) Noisy image, PSNR=8.13. (c)Denoised image, PSNR=24.3. 



(d) Fine nerve fibres. (e) Noisy image, PSNR = 8.13. (f)Denoised image, PSNR=26.2. 


Figure 6: Image denoising using CBM3D. 

3.2 Experiments on natural images 

In this section we demonstrate the utility of an image denoiser for image recovery from compressed 
measurements. We refer the reader to [58,59] for similar simulations with the AMP algorithm. For 
this purpose we shall use the CBM3D denoiser of [29]. First let us demonstrate that CBM3D is 
indeed well suited for denoising. To this aim we consider two images and add Gaussian noise to 
them. The first image is of the Eram Garden in the central Iranian city of Shiraz. The second 
image depicts fine nerve fibres highlighted with a fluorescent dye. Figure 6 shows the performance 
of CBMD3 on these two images with added Gaussian noise. The signal to noise ratio in these 
images is reported in terms of PSNR (101og 10 (number of pixels / ||x - *||^ )). These figures clearly 
show that CBMD3 is a good denoiser for images in the presence of Gaussian noise. 

We now turn our attention to using the CBM3D denoiser for recovery of an image from under¬ 
sampled linear measurements. Our measurement process consists of modulating each pixel of the 
image by a random i.i.d. ±1 mask, applying a two dimensional Discrete Fourier Transform (DCT), 
and then picking a random subset of size m of these DCT coefficients. Since each photograph is 
in color, we apply this measurement process to each color band for a total of 3 m measurements. 
We shall use the short-hand A : R 3n -»■ IR 3m to denote this linear measurement process, where A(x) 
takes a color image x e IR 3n (with a total of 3 n pixels) and outputs a vector y = A(x) e IR 3m con¬ 
taining 3 m measurements. We note that this measurement process is an example of SORS matrices 
obtained from Bounded Orthogonal Systems (as discussed in Section 2.4) and as a result our theory 
applies to this measurement process. 

We now wish to recover the original image from such under-sampled measurements. We start 
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Relative error vs. number of measurements. 



under-sampling ratio ((?) 


Figure 7: Performance of CBM3D based proximal gradient descent on two images. 
Figure depicts relative error of the recovered image (||x - x\ t / ||®||^ ) as a function of 
the number of measurements m. 


from z§ = 0 and apply the following proximal gradient updates 


z r+ 1 - S (z T + (y 7l(2 r )), A r ). 


(3.3) 


Here S is the denoising procedure with tunning parameter A r . We shall use the CBM3D denoiser 
for S. We note that this iterative update is directly related to the projected gradient update 
discussed throughout this paper. Indeed, the projected gradient update with the level R set to 
f(x) can be viewed as a denoiser S with A r tuned optimally. In a companion paper [70], we will 
discuss how the theoretical framework of this paper generalizes to the proximal update in (3.3) 
even when A r is not optimally tuned. We shall use A r = Aoy T in our experiments. This choice 
is based on results in [70] which suggest that this is a good tuning strategy. We now apply the 
proximal update with the CBM3D denoiser with Ao = 28.8675 ||£C||^ ^ and 7 = 0.95. We remind 
the reader that the image has 3 n total pixels (n pixels in each color band) we vary the ratio m/n 
in the interval [0,1]. For each value of m we run 200 iterations of the proximal update (3.3) and 


record the relative error 




Me 


(color images are viewed as a large vector). We report the results 

in Figure 7. This figure indicates that the relative error decreases as a function of the number of 
measurements. Furthermore, this value falls below 5% for m > 0.3n and m > 0.4n for the Erarn 
Garden and Fine nerve fibres images, suggesting that 30% and 40% under-sampling may be enough 
to approximately recover the image. Figure 8 shows that indeed the recovered images are good. We 
note that we did not expect projected gradient to recover the images exactly as the de-noiser may 
not be perfect in capturing the structure of a real image. The correct analogy in sparse recovery 
literature is that the image is not “exactly” sparse. Rather it is only approximately sparse. 
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(a) m = 0.3n, PSNR=29.1007. 


(b) to = 0.4n, PSNR=38.2208. 


Figure 8: Recovered images from 3m under-sampled DCT measurements using 
CBM3D proximal gradient descent. Here, 3n is the total number of pixels, n pix¬ 
els for each color band. 


4 Prior Art 

Optimization problems such as (1.1) and in particular Fi-minimization techniques have been emp- 
pirically used to find structured solutions to underdetermined linear systems in multiple scientific 
fields [26,28,47,88]. While there were some theoretical results describing the success of these algo¬ 
rithms in recovering structured solutions, the required number of measurements were sub-optimal. 
The last decade has seen a flurry of activity surrounding these problems in part due to the sample 
optimal results obtained for certain random measurement ensembles in sparse recovery [20,33] and 
rank minimization [19,46,51,79] problems. 

A significant focus in the literature has been on characterizing the precise number of measure¬ 
ments required for recovering the structured signal exactly via (1.1) in the special case where the en¬ 
tries of the measurement matrix are i.i.d. real-valued standard normal variables. For sparse/matrix 
recovery problem the papers [21,33,79,81] established sample optimal results up to numerical con¬ 
stants. For sparse recovery problems . These constants were later made more precise in [15,24]. 
Earlier Donoho [34] and Donoho and Tanner [37-39] showed that the precise number of measure¬ 
ments (as a function of sparsity) required for the success of l\ minimization can be characterized 
by an asymptotic phase transition (a.k.a. the Donoho-Tanner curve). More recently, heuristic ar¬ 
guments from statistical physics where used to explain this phase transition [35]. These heuristics 
were justified in [6,7] for Gaussian matrices and in [5] for matrices with sub-Gaussian entries per¬ 
turbed by small Gaussian noise. Related, in [83] Stojnic established lower bounds on the number 
of measurements that matched the Donoho-Tanner curve in an asymptotic regime. Please also 
see [67,80] for related results for rank minimization via the Nuclear norm. More recently, Chan- 
drasekaran, Recht, Parrilo and Willsky [24] derived the first precise non-asymptotic bounds via 
Gordon’s lemma [45]. It was shown in [3] that these lower bound are asymptotically sharp. Please 
also see [84-86] for results of a similar flavor in special cases. In Theorem 2.3 we have shown that 
projected gradient descent with a properly chosen step size can match these asymptotically sharp 
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results for convex functions, providing an algorithmic proof. Furthermore, our results in Theorem 
2.2 are more generally applicable and also apply to any function including non-convex functions 
albeit with a loss in terms of a small constant. 

For more structured measurement ensembles, such as subsampled Fourier matrices, Candes and 
Tao [20] obtained near optimal sample complexity results for t\ minimization that were sharp upto 
logarithmic factors. These results were further improved in [27,81]. Please also see [18,43,52,78,91, 
92] and references therein for similar results for many other ensembles. Please also see [16,17,19] for 
other structured ensembles arising in rank minimization problems. Moving beyond sparse recovery 
and rank minimization [2,62] establishes near sample optimal results (up to logarithmic factors) for 
decomposable norms. Such norms are convex and include important norms such as l\ and nuclear 
norms. In this paper utilizing our results from a companion paper [71] we have established near 
sample optimal results for SORS matrices that apply to any function convex or non-convex. To 
the extent of our knowledge this is the first sample optimal result using a computational friendly 
matrix that applies to general signal structures and functions. Indeed, we are unaware of any 
sample optimal results even for general convex functions. 

The prior works we have discussed so far are not algorithmic in nature in the sense that often 
the properties of a convex optimization problem have been studied without focus on how such 
problems are actually solved. While in principal such problems can be solved in polynomial time 
via interior point methods for many application first order methods may be more suitable. Bounds 
on convergence rates of first order schemes [64] for general convex functions are known. However, 
since the penalty functions used for solving linear inverse problems are non-smooth, these bounds 
are often very pessimistic and rather far from the actual rates of convergence. For example [8,9,50], 
establish sub-geometric upper bounds on convergence rate for sparse and low-rank estimation prob¬ 
lems via proximal/projected gradient algorithms such as [65]. Related see also interesting works 
by Goldfarb, Osher, Yin and collaborators [25,66,94]. In practice the convergence rate of these 
algorithms are often geometric. Agarwal, Negahban and Wainwright [2] showed that when the cost 
function is a decomposable norm the rate of convergence of projected gradient descent is indeed 
geometric. See also [54] for related results/generalizations. The authors obtained geometric conver¬ 
gence rates in terms of restricted eigenvalues. Specialized to Gaussian measurements these results 
do not come with explicit or sharp constants. In contrast we have developed precise convergence 
rates that applies to any function (convex or non-convex). In the case of Gaussian measurements 
we have also provided sharp constants. More recently, a few papers focus on a penalized formu¬ 
lation of (1.1). For example, Xiao and Zhang [93] obtained geometric rates of convergence based 
on the Restricted Isometry Property. These results were further generalized by Eghbali and Fazel 
to decomposable norms [40]. We defer the study of proximial algorithms for general functions to 
a future publication. Convex functions are not the only way to enforce structure and often it may 
be more suitable to use non-convex functions. For sparse recovery problems the first results on 
geometric (and in fact linear) convergence of such problems are due to Tropp and Gilbert [89] and 
Garg and Khandekar [44] who studied convergence of greedy hard-thresholding algorithms. See 
also related greedy strategies [60, 61] which have similar guarantees. More broadly convergence 
of iterative hard thresholding and its variants such as singular value hard thresholding have been 
studied by multiple authors [4,10,13,41,48,49,55]. These authors show that under RIP or matrix 
RIP conditions such algorithms have a linear convergence rate to the structured solution. Please 
also see related results [42,82] on characterizing the properties of the optimal solution when using 
non-convex t v (with p < 1) minimization problems for recovering sparse signals. We note that 
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the latter results do not ensure that i v minimization is tractable as they do not have convergence 
guarantees. Our general framework covers any function and thus allows for precise analysis of rates 
of convergence such algorithms as well. In particular, for Gaussian ensembles we are able to also 
provide sharp constants. 

A few recent papers study computation-statistical tradeoffs for machine learning problems. For 
example, in [22] Chandrasekaran and Jordan focus on such tradeoffs in the context of denoising 
via hierarchies of convex relaxations. In contrast, our focus is on a fixed cost function and linear 
inverse problems. More specifically, we focus on computation-statistical tradeoffs based on the 
convergence rate of projected gradients. We refer to Agarwal’s dissertation [1] and references 
therein for other examples. Finally, in [11,12] the authors study computation-statistical tradeoffs 
for linear inverse problems. See also the very recent paper characterizing statistical properties of the 
optimal solution [53] based on results in [90] brought to our attention by Volkan Cevher. The focus 
in these papers are not directly on projected gradients. Rather, the authors focus on “lowering the 
computational cost through additional smoothing”. In particular, the rates of convergence used in 
the analysis of these papers are not geometric and are based on more classic results where the error 
is inversely proportional to the square of the number of iterations. 

Finally, we would like to mention related work surrounding the Approximate Message Passing 
(AMP) by Donoho, Montanari and Maleki [36]. See also [76,77] for generalizations of AMP. The 
updates in AMP are similar to a projected gradient update with an additional memory term similar 
in nature to Nesterov’s accelerated scheme [63]. However, motivated by ideas in statistical physics 
this algorithm comes with a very specific recipe for the coefficient of this memory term (a.k.a. the 
Onsager term) [31]. For Gaussian measurement ensembles and separable functions of the form 
/(*) = Y,2= i d( x i) with g a pseudo-Lipschitz function it has been shown that the convergence of this 
algorithm can be described asymptotically via certain state evolution equations [6,7]. An important 
consequence of this analysis is that AMP asymptotically achieves a linear rate of convergence of 
for pseudo-Lipschitz and separable functions exactly above the phase transition of (1.1) 
(m > mo). See also [31, 95] for predictions of performance of the AMP framework for general 
functions using highly plausible but non-rigorous statistical physics arguments. In contrast, our 
results are non-asymptotic, rigorous and apply to any function. However, our theoretical results and 
empirical studies show that projected gradient descent can not achieve a linear rate of convergence 
exactly above the phase transition of (1.1) even when the cost function is separable. We believe 
that our theoretical framework may provide useful insights for precise, non-asymptotic analysis of 
AMP for general functions. 

5 Discussion 

In this paper we have discussed sharp time-data tradeoffs for linear inverse problems. We would 
like to pause to discuss how our results can be further improved. 

• From denoising to compressive sensing. In addition to the connection between the rate 
of convergence and sample complexity, our results also connect de-noising and compressed 
sensing. That is, we establish a precise connection between the capability of a function / 
when used for de-noising and its capability for recovering a structured signal via the PGD 
updates. To be more specific, it is easy to show 
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Theorem 5.1 Consider a structured signal x e IR n and assume we observe a noisy version 
of this signal x + w, where to e IR" is Gaussian noise distributed as Furthermore, 

let Vic with K. - {z e IR n j f(z ) < f(x)}. Then 


sup 

C7>0 


\\'Pic(x + w)-x\\ 2 h 


<y 


< 4 u?(Cf{x) n B") ~ 4mo- 


This theorem combined with the results in this paper shows that the min-max rate of denoising 
is intimately related with the phase transition curve for exact signal recovery when using / 
(up to a small constant less than 4). To the extent of our knowledge this is the first result to 
establish this connection for general /. For convex / please see [22] for analogous results to 
Theorem 5.1 above. We note that the results for convex / are sharper in the sense that it is 
known that the the min-max rate of denoising coincides with the phase transition curve for 
exact recovery. The connection between compressive sensing and denoising was conjectured 
in an asymptotic setting for the Approximate Message Passing (AMP) algorithm [35,36] by 
Donoho, Johnson and Montanari in [31]. Please also see [58] for related discussions and 
numerical simulations. Theorem 5.1 above combined with the results proven in this paper 
provides strong evidence that this conjecture holds non-asymptotically for projected gradients 
using any function (it essentially proves this conjecture holds upto a small multiplicative 
constant). For convex /, prior work [6,7] established this connection first for LASSO and 
later on for general / [3,23,68] with a precise constant of 1. We believe that our analysis may 
shed light on a complete proof of the conjecture of [31] and we consider this a very interesting 
future research direction. 

• Sharper sample complexity for non-Gaussians. Compared to our results for Gaussian 
ensembles our sample complexity results and convergences rates for non-Gaussian ensembles 
(ISG and SOS) are sub-optimal by constant/log factors. In particular we did not have results 
with sharp constants for these ensembles. Providing sharp constants in our sample complexity 
bounds is an interesting direction for future research. In particular, our experiments suggest 
that the sample complexity for SOS matrices matches that of Gaussian ensembles, i.e. the 
phase transition for Gaussian and SOS matrices are essentially identical. 

• Sharper time—data tradeoffs. The numerical simulations in Section 3.1.2 and in particular 

Figure 3 suggest that the rate of convergence roughly scales like \Jrho/m where rho is the 
empirical phase transition and m is the number of measurements. Establishing such a precise 
upper bound on the convergence rate is an interesting future direction as it would lead to 
more precise time-data tradeoffs. Related, the sample complexity of Theorems 2.3 and 2.4 
seem to match the phase transition curves obtained by numerical simulations in Section 3.1.1. 
However, these two theorems where only limited to convex functions. It would be interesting 
to establish a similarly sharp result for non-convex functions. While we did prove results for 
non-convex functions in Theorem 2.2 our numerical results in Figure 2 seem to suggest that 
(a) Kf <2 for non-convex functions and (b) the convergence rate even for convex functions 
can possibly be improved to p(/x) < X 1 \Z^rF w \J 5-8^ (and in turn the requirement on 
sample complexity can be improved to m > ~ 5.8mo). Closing these gaps for the 

non-convex case is particularly important as it would help explain the surprising numerical 
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results in Figure 3b which seems to suggest that Tw 2 works better than both Iq and t\ for 
sparse recovery problems. 

• Extensions to other update strategies and general loss functions. While in this 
paper we have focused on projected gradient updates we would like to point out that our 
results naturally extend to many other update strategies such as proximal gradients, stochastic 
updates etc. These results will be the subject of a companion paper [70]. Also our results can 
be generalized to other loss functions (i.e. non-^ 2 ) in a straightforward manner which will be 
detailed in a future paper. 


6 Proofs 

In this section we shall prove all of the stated theorems. We will first prove the deterministic result 
of Theorem 1.2 and then show in later sections how all other theorems follow by bounding (with 
high probability) the convergence rate p(p) and noise interaction term of this theorem for 

various random ensembles and learning parameters p. 

Throughout we shall use S”^ 1 and B n to denote the unit sphere and Euclidean ball of R n . For 
a set K. € [R n and a point x € IR ri , we use Vtc(x) to denote the projection of x onto the set /C. For a 
set /C and point v e IR n we define 


dist(u, /C) = min ||n-wL . 

ite/C 2 

We pause to remind the reader of the definition of the descent set and tangent cone from Definition 
1.1. The set of descent of the function / at a point x is defined as 

V f (x) = {h- f(x + h) < /(*)}. 

The tangent cone Cf(x ) of / at a point x is the conic hull of the descent set. That is, the smallest 
closed cone Cf(x) obeying Vf(x) c Cf(x). For ease of exposition throughout the proof of this 
theorem we shall use the shorthand V and C to refer to Vf(x) and Cf(x). 

We also remind the reader that p(p) is the convergence rate defined as 

p(p) ■= p(p,AJ,x) = sup u* (/ - pA* A) v, 

ii,,ueCnS n-1 

is the noise amplification factor defined as 

w 

'■= ^(A,f,x,w) = p- sup v*A -——, 
vtCnS ™- 1 ll^lba 

and Kf is a constant equal to one for convex functions and equal to two for non-convex functions. In 
order to prove our results we first state and prove a few results regarding cones in higher dimension. 

6.1 Cone and projection identities 

In this section we will gather a few results regarding higher dimensional cones and projections that 
are used throughout the proofs. We begin with a well known lemma about projections onto convex 
sets. This result is standard and we skip the proof. 
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Lemma 6.1 Assume 1C e IR n is a convex set and v e IR n . Then for every u £ 1C we have 


\\v-V K (v)\\ 2 h + \\u-Vk,(v)\] 2 < \\v-u\\ . (6.1) 

In particular if 0 e 1C, then 

\\v-V K (v)\\l + \\V K (v)\\l<\\v\\l. (6.2) 

We now state a result concerning projection onto cones. 

Lemma 6.2 Let C c IR n be a closed cone and v e IR n . The followings two identities hold 

H\l=\\v-Vc(v)\\l + \\V c (v)\\l, (6.3) 

\\Vc(v)\\ i2 = sup uv. (6.4) 

ueCnB" 

Proof The first identity is known as Moreau’s decomposition and is standard. Let us turn our 
attention to proving (6.4). If Vc(v) = 0, then 0 is the closest point to v. This implies that for all 
t > 0 and all u e C we have 


\\v\\ e2 < \\v - t.u\\ e2 =>■ uv<t\\u\\ e2 . 

Taking the limit t -+ 0, we conclude that for all u e C we have u*v < 0. Also since Vc(v) = 0 e (CnB n ) 

sup uv = 0 = \Vc{v)\ t , 
iteCnB" 

holds which proves (6.4) when Vc(v) = 0. To address the case when Vc(v) * 0 set u = \\-p ^)\\ —■ 
Clearly u*v = ||T , c(' u )||^ 2 and u e C n B n so that 

sup u*v > u*v = \\Vc{v)\n . 

tteCnB" 

Therefore, to prove (6.4) it suffices to show sup ueCn6 n u*v < ||'Pc( , n)||£ 2 - Suppose this is not the case 
and there exists u' such that (u 1 )* v > ||'Pc(' l OI|f,- Set u = u '/ ||u'||^ 2 and note that since u! e C n B n 
we must have v*u > ||'Pc( l, )ll£ 2 - Now choosing w = (v*u)u c C we have \w\^ = v*u > ||'Pc( , n)ll^ 2 - 
The latter together with (6.3) implies 

dist 2 (u,C) = H |\ - \[Pc{v)\] 2 > 

which is in contradiction with w e C. 

The following lemma is straightforward and follows 
tances. 

Lemma 6.3 Suppose 1C c [R n is a closed set. The projection onto K, obeys 

Vk.(x + v)-x = V k _ {x} (v). 

The next lemma compares the length of a projection onto a set to the length of projection onto the 
conic approximation of the set. Our approach is in part inspired by well known results in geometric 
functional analysis e.g. see [75, Lemma 8.2] for related calculations. 


II 2 || || 2 || || 2 

v II/ 2 -HI* 2 = \\ v ~ w h 2 > 

■ 

from the fact that translation preserves dis- 
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Lemma 6.4 (Comparison of projections) Let T> be a closed and nonempty set that contains 
0. Let C be a nonempty and closed cone containing V (D c C). Then for all v e [R n , 


\\'Pv(v)\\ h <2\\Vc(v)\\ e2 (6.5) 

Furthermore, if V is a convex set. Then for all v e [R n , 

\\Vv(v)\\ e2 <\\Vc(v)\\ h . ( 6 . 6 ) 

Proof Let us first show the statement (6.6) for convex V. Since DcC. ||v -Vc(v )\\e 2 ^ \\v-V v (v)\\ 1 2 ’ 
Furthermore, when V is convex using the fact that 0 e V by Lemma 6.1 equation (6.2) we know 
that §i? -Vx>(v )||| + \\Vx>(v) ||| 2 < \v\ 2 £ 2 . The latter two identities together with Lemma 6.2 equation 
(6.3) yields 

\\r c (v)\\l = Ml - \\v-Vc(v)\\l > Ml - \\v-Vv(v)\\l > WVvMWl • 

Now, we turn our attention to proving (6.5) for nonconvex V. By definition of projection onto sets 

Ml > \\v~VvM\l = Ml + W V v( v )\\l -2(v,VvM). 

This yields 

VvM 


\Wv(v)\l <2(v,Vv(v)) = 2(v, IPcWlfc 

'P-p(v) 


\\Vv(v)\\ h <2 [v, 


\WvM\\ h 


Since C is a cone that contains V , ~ has unit length and belongs to C we have 

W'PvMWb <2 Iv, )l<2- sup u*v< 2\\V c (v)\ e 

\ \\l J V{V)\\ h l U eCnB" 

where in the last inequality we have used Lemma 6.2 equation (6.4). This completes the proof of 
(6.5). ■ 

We end this section by proving a lemma that shows that scaling a vector increases its projection 
onto set. 

Lemma 6.5 Let V c IR" - be a nonempty set. For any t2>ti>0 


\\V v {tiw)\\ i2 < \ Vv(t 2 w )\ l2 , 

holds for all w e IR n . 

Proof By definition of projection onto a set we have 

\\Vv (hw) - h w \\l < \\Vv (t 2 w) - tiw\\l and \Vv (tew) - t 2 w\\l < \\Vv (hw) - t 2 w\\l. (6.7) 

Summing these two identities we conclude that for any t 2 > t\ 

{w,Vx>(t2w)) z {wMvihw)), ( 6 . 8 ) 

holds for all w e IR n . Now rearranging the first inequality in (6.7) and using (6.8) we conclude that 
for t 2 > t\ 

\\Vv(t 2 w)\\l - \\Vv(tiw)\\l > 2ti (( w,Vv(t 2 w)) - (w,Vv(hw))) > 0, 
completing the proof. ■ 
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6.2 Proof of the deterministic result (Proof of Theorem 1.2) 

To prove Theorem 1.2 note that if we apply the PGD update 


Zt+i — 'P (z T + y T A (y Az t ) ), 


then the difference between our iterates and the actual solution h T = z T - x is inside the descent 
set V. Thus we have the following chain of inequalities 

IIh r+ i |^ 2 = || z T+ i — x ||= \\Vk(z t + y T A (y — Azy')') — x\\^ 

- \Pk.-{x} ( z t ~ x + UtA (y - Az t ))||^ 

( = \\Vv{z T -x +y r A*{y-Az T ))\\ h 

( b ) 

< K f \\Vc(z T - x + y T A (y - Az T ))\\ h 
= K f || V c (h T + y T A* (w - Ah r ))\\ h 

< Kf sup v*[(I n - p r A*A)h r + y T A*w] 

veCnB n 

< Kf[ sup v* (I n - p r A* A)h r + p, T ■ sup v* A* m] 

vsCnB n veCnB n 

< Kf[p(y T ) ||h r ||^ + y T ■ sup v*A*w ]. (6.9) 

veCr\B n 

In the inequalities above (a) follows from Lemma 6.3 and (b) follows from Lemma 6.4. Now note 
that 


y T - sup v*A*w = p T i sup v*A* W ) ||w|h = ^ T (A) \\w\\ e . 
vcCnS™ \veCnZ3" \\ W H 2 ) 

Plugging the latter into (6.9) we arrive at 

II h-r+ 11| ^2 — ^/p(Mt) || h T || + Kf ’ (A') || w || . 

We now apply this inequality recursively with y T - p which yields 


II h T 


\l2 


<K f p(y) (nfp(y) ||h T _ 2 ||^ 2 + K f ■ ^( A )) + n f ■ £ M ( A ) ||w| 


ii 


^ { K fp(p)Y Wholly + [ { K fp(y)Y 1 + {^fp(y)) 

l-(K f p(y)) 


T -2 


(K f p(p)) + 1 ]«/-^(A) 


W\ 


£2 


^(K f p(y)) T \\h 0 \\ h 


1 - Kfp(p) 


-Kf^(A) 


w 


h ’ 


concluding the proof. 


6.3 Proof of results with Gaussian ensembles (Proof of Theorems 2.2, 2.3, 2.5 
and 2.4) 

Our proofs regarding Gaussian ensembles follow from Theorem 1.2 by controlling the convergence 
rate p(p) and the noise amplifications factor ^(A) with simple algebraic manipulations. The 
argument for controlling the noise amplification factor £ M ( A ) for Theorems 2.2, 2.3 and 2.4 is 
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essentially identical and is stated in Section 6.3.2. The major different in the proof of these theorems 
however is in the way we control the convergence rate p(g). This requires more care and different 
proof strategies are needed for the different choices of the learning parameter p in each theorem. 
We shall explain how we control the convergence rate p(g) for each of these theorems in Sections 
6.3.3, 6.3.4, and 6.3.5. We shall detail the proof of Theorem 2.5 which provides a lower bound on 
the convergence rate in Section 6.3.6. Before we explain these proofs however, we need to state and 
prove a few essential results on Gaussian comparisons and the supremum of Gaussian processes 
which play a crucial role in our proofs. 

6.3.1 Gaussian comparison inequalities 

Lemma 6.6 (Slepian’s inequality) If Xf and Yt are a.s. bounded, Gaussian processes on T 
such that E[X#] = IE[Yj] and E[A” 2 ] = E[Y) 2 ] for all t e T and 

E[(X t -X s ) 2 ]<E [(Y t -Y 8 ) 2 ], 


for all s,teT, then for all real t, 


E[sup X t \ < E[sup Y t ]. (6.10) 

UT ieT 


Furthermore, 


P{UPQ>r/ t ]}<P{U [Y t >Vt]}- (6.11) 

teT UT 

Lemma 6.7 (Gordon’s restricted eigen-values) [45, Theorem A] Let C e P n be a cone and 
let A be and m*n matrix with entries independently drawn from the standard normal distribution 
A/"(0,1). Also define 


bm = n\\9k 2 l 

where g e P m is distributed as JY(0, I m ). Then for all u e C 


\\Au\ 


e 2 


> b m - uj(C n S n_1 ) -i] 


u 




( 6 . 12 ) 


holds with probability at least 1 - e 2 . Furthermore, for all u € C we have 


\\Au\ 


12 


5= bm + oj(C n S n ^) + p 


u 


l 2 


(6.13) 


holds with probability at least 1 - e 2 . 

We shall now state a generalization of Gordon’s escape through the mesh lemma. The proof of this 
lemma is deferred to Appendix A. 
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Lemma 6.8 Let T c IR Tl and define 

bm = mg\\t 2 l 

where g e IR m is distributed as J\f (0, I m ). Furthermore, define 


Then 


<t(T) := max ||u|L 


\Au\z 2 - b m \\u\\ h | < u(T) + r/, 


holds for all u ef with probability at least 


1 - 4e 8ct2 ( t ) . 

Finally, we shall make use of the following lemma proven in Appendix B. 

p/ t +1 \ 

Lemma 6.9 Define fi(t) = \pi ^ 'j & \Jt. Then for 0 < mo <m we have 

1 '■ 2 ' 

fijrno) <t>{m) 


m 


6.3.2 Controlling the noise amplification factor £ M (A) 

In this section we shall show how to bound the noise amplification factor. Note that for a fixed 
vector w, A* has the same distribution as a random Gaussian vector g e IR n with i.i.d. A/"(0,1) 
entries. Therefore, 


UA) 


T 


sup g v. 

vsCnB™ 


Applying standard results on concentration of supremum of Gaussian processes 

sup g*v<«j(CnS n - 1 ) + T J < y /mS, 


holds with probability at least 1 - e 2 . 

6.3.3 Controlling the convergence rate p(p) with /t« A (Proof of Theorem 2.2) 

In this section we wish to bound the convergence rate p(p) as stated in Theorem 2.2. More 
specihcally, let mo = cj)~ 1 (u+r]) with c f>(t) = \/2 A « \/t be the minimum number of measurements 

V2 ' 

required by the phase transition curve (please see Definition 2.1 for a reminder). Then we will show 
that as long as 


0 

m > 8K)-mo, 


(6.14) 
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1 


rj I 

there is an event of probability at least 1 - 8e s , such that on this event for g = A- := —— 

p(n) < \/s—- (6-15) 

V m 


To this aim define the sets 


71 = C n S n_1 - C n S n_1 and T+ = C n S n_1 + C n S" -1 , 


and note that 


sup ||u||^ < 2 and sup ||v||^ < 2. 

2 1)677 

Furthermore, 

w(71) = E[ sup g 1 (u - «)] < E[ sup g 1 u + sup g T v] = 2 oj(C n S"^ 1 ), 

u,vsCnB n neCnB” De-CoS ' 1-1 

and similarly, cj(71) < 2 u(C n S n_1 ). For ease of presentation we shall use the short hand oj ■= 
uj(C n S n_1 ). Thus, applying Lemma 6.8 on 71 and 7+ we conclude that for all u,v € C n S n_1 

ll- 4 («-'Ollf 2 1 (\\ u ~ v \\i 2 + 2 ^ + ^ j (6.16) 

and 

ll A ( w + ^)llt 1 | m ax{ ||u + «||f 2 -2 ^ b +7? ^ ,0}j (6.17) 

_r? 

hold with probability at least 1 - 8e s . 

If || tt + v H*2 then using equations (6.16) and (6.17) we can conclude that for all u,v € 

C n S n_1 


Vr 7l*A 1 

M (I -— )vs i 

u m ^ 

_ i 

“ 4 


(u + vy(I-^-^)(u + v)-(u-vY(I-^-^)(u-v) 

I'm '-'m 


\ u + v \\h ~ jr !l A (^ + ,,; )lll - l u -«ll < 2 + T |A(u-t >'" 2 


<?2 


(w + 7/)/|| II || || x 

* —7-(Il« + V|< 2 + Il«-V||< 2 ) 


2\/2 


(w + 0 


(6.18) 


|| w + u|b 2 + || li 


u||^ < 2^/2 which follows from the fact that ||w + v\\^ + ||tt - u|| 2 o = 4. 


^2 


11*2 
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then utilizing equations (6.16) and (6.17) again we can conclude that for 


If \\u + vL < 2 
" life b m ’ 


all u, v e C n S 


n—1 


A* A 1 

_ i 

“ 4 


(u + vy(i-^-^)(u + v)-(u-vy(i-^-^)(u-v) 

Um Urn 

W u + V \\h ~ p- ll A ( w + ' ,; )llfe - ll^-HIfe + ll A ( w -^)ll 2 


(“) i 
< - 
4 


b 2 

'-'m 


fe 


2 2 1 
\\u + v\\ h - \\u-v \\ £2 + — \\A(u - u)|| 


( b ) (w + ??) 2 1 / 1 2 II 1,2 

^ ^- + 7 I vT II ^■(' u “ v ) life _ II u ~ Hlfe 


6 2 


4 \ 6 2 


bl bm " "* 2 

„ 0 (^ + f /) 2 , n (u + T]) 

- b 2 m b m 

<| 2V2<^4, 


(6.19) 


where (a) follows from equations (6.16) and (6.17), (b) from the fact that ||ti + «||^ 2 < 2 anc [ 
(c) holds as long as b m > ‘l\[2(uo + r/). 

Now combining (6.18) and (6.18) for ^ = 1/6^ we have 


1 \ b m ) b m 

as long as \/8 ^ +? ^ < 1. The proof of (6.24) and the theorem is complete by noting that 


p(p) = sup 

u,vtCnS n ~ 


oo + r] 


which follows from Lemma 6.9 since by definition b m = cf>(m ) and oo + r] = ^(rao). 

6.3.4 Controlling the convergence rate p(/r) with /r « (Proof of Theorem 2.3) 

In this section we wish to bound the convergence rate p(/u) for convex functions / as stated in 
Theorem 2.3. More specifically, let mo = (f>~ l (oo + rj) with = y/2 L ~ \/t be the minimum 

^ v 2 / 

number of measurements required by the phase transition curve (please see Definition 2.1 for a 
reminder). Then we will show that as long as 


m > mo, 
2 


( 6 . 20 ) 


there is an event of probability at least 1 - e tr - e 7 ”, such that on this event for /r = n "' J 

0.3 


p{p) ^1 - 


m + n 


m - wm o 


m+^/n) 

( 6 . 21 ) 
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To this aim first note that by standard results on concentration of spectral norm of random matrices 7 
for all u e IR n 


n 2 100 , ,—\ 2 

\ Au l 2 ^ ^-(v^ + v”) 


u 


2 

h ’ 


holds with probability at least 1 - e 7Tl with 7 a fixed numerical constant. The latter statement 
is equivalent to the matrix I - fiA* A being Positive Semi-Definite (PSD) for p = -— 0,99 ■> . Now 

applying the generalized Cauchy Schwarz inequality for the PSD matrix I - pA* A we have 

p(p) = sup u*(I-pA*A)v 

•UjtieCnS 71-1 

< sup \J ( u*(I - pA*A)u) ■ (v*(I - pA*A)v) 

•UjUeCnS 71-1 

< sup u*(I-pA*A)u 

tteCnS "- 1 

<6 ' 22) 


Furthermore, by Lemma 6.7 


— ( inf II AulL ) ^ 

L VueCnS 71-1 2 / 


> 1 


(u + ri) 


holds with probability at least 1 - e 2 . Plugging the latter into (6.22) and using Lemma 6.9 we 
conclude that for p = °' 99 


(,/m+Cn) 



= 1 - 


0.99 bt 


-y/moY 


= 1 - 


<1 


m (\/m + \friy 

m 2 (m + n) 

m + n m +1 ^ 

m + n 


(y/m-y/rr^Y 


The last inequality follows by using the fact that for m > 2, m/(m + 1) > 2/3 together with 


bm. — 


\fm +T 


and 2 (m + n) > (\fm + \fn) 


' This also follows from Lemma 6.7 by using C = R n and 77 = 0.005\/n. 
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6.3.5 Controlling the convergence rate p{p) with a structure dependent choice of p 
(Proof of Theorem 2.4) 

In this section we wish to bound the convergence rate p(p) for convex functions as stated in 

p( t +1 ^ 

Theorem 2.4. More specifically, let mo = <^> _ 1 (u; + p) with = \/2 , ~ y/t be the minimum 

number of measurements required by the phase transition curve (please see Definition 2.1 for a 
reminder). Then we will show that as long as 


m > 4mo, 

2 2 
_ ZL_ _ 2L_ 

there is an event of probability at least 1 - e 2 - 4e 8 ? such that on this event for 

_ 3 

m 2 - \j2mo y/m ■ (7 m - y^) 2 


(6.23) 




we have 


where 


(m 0 - 2 v /mmo + 2m) 


p{p) < — 

m 


(6.24) 


^( 7 ) = 2 


(^( l - yy ) 1 - 5 -^) 2 


(7-277 + 2) 

Throughout this section we shall use the shorthand M.^ = I - pA* A. First note that we have 
uM^v = - (u + v)* Mfj, (u + v) - - (u - v)* M^ (u - v ) 

Also note that u + vzC. Therefore, by Gordon’s lemma (( 6 . 12 ) in Lemma 6.7) with probability at 

_r? 

least 1 - e 2 


Now define 


Note that 


- (u + v)* M M (u + v) < 


\\u + v\ 


h 


(l - p (b m - LU - p) 2 ) . 


T = j(it - v) : u, v e C n S n 1 j. 


<y{T) = max ||iu 


eT 


IA 


max 

ttjDeCnS" -1 


||tt - v\\ g < 2 . 


Applying Lemma 6.8 for w e T with probability at least 1 - 2e s have 


Aw \\l 2 < (\\w\\h bm + u(T) + 2p)- 




\\A(u-v)\\ 2 h < (\\u-v\\ i2 b m + u:(T)+ 2p) 2 . 
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Note that 


u(T) = E[sup g*w] = E sup g* (u - v) < E[ sup g*u] + E[ sup g*v] = 2u(C n S n ). 

wtT veCnS "- 1 J iteCnS " -1 ueCnS "” 1 


Thus, 


\\ A ( U - V )fe 2 ^ ( b m\\u-v\\ h + 2(u + g)f, 


holds with probability at least 1 - 4e s . Therefore, 


(u - vfM^u -v)> ||w -v\\g 2 - g(\\u - v\\ h b m + 2(u + iff 

= (l - A) II w -vf h - 4 fib m (u + g) ||u - v\\ e2 - 4 g(uj + gf 


ii ii 2 II II 2 

As a result using the fact that || u + v |L + || u - v |L =4 


u* M^v < 


\\u + v\ 


t-1 


(l- g(bm -uj-g) 2 ) - -(1 -gb 2 m ) \\u - v\\] 2 + gb m {u + g) ||u - t >|| <2 + n(u + gf 


4 -\\ U ~ V \\l .. 1 


4 -(l - g(bm - u - gf) - -(1 - gb 2 m ) ||u - v\\ + gb m (u + g) ||u - v\\ h + g(uj + gf 


\u-v\ 


ii 


(2 - g (b m - uj - gf - gb 2 m ) + gb m (u + g) ||u - v\\ i2 + g(u + gf + (l - g (b m - lo 


■ a 


\u-v\\e 2 + P\\u-v\\ h +7 


/3 2 

<——v 7 
4 a 


g 2 b 2 m (u + gy 


(2 - g(b m -uj-gf - gb^) 
g 2 b 2 m (u> + gf 


g(uj + gf + (l - g (b m - <x> - gf) 


+ g[(uj + gf - ( b m -uj- gf] + 1 . 


(2 - g[(b m - u - gf + tf n ]) 

We note that there is an additional constraint which we need to impose here which is 


P „ gb m (u + g) 

- < ^ O - - 7T 

2 a /n tl - - a2 


(2-ii(bm-u>-gf-frifa) 


< 1 . 


Using v := we have 


* g 2 bf n u 2 

U MnV < - -- 

M 2 -^( 1 / 2 - 21 /+ 2 ) 


+ fib m ( 2 z/ 1 ) + 1 . z/), 


which holds as long as 




2 - gbl, (z / 2 - 2v + 2 ) 


< 1 gb 2 n < 


(v 2 -is + 2 )' 


(6.25) 


?/) 2 ) 


38 



Now note that 


+ 2x. 


d 0 (2 - x (is 2 - 2u + 2)) + xv(v - 1) 

dv (2 - x (is 2 -2u + 2)) 2 

The latter is positive when x > 0, 0 < is < 1, and x < ->_ 2 . This in turn implies that g(fsb 2 n , is) in 

is. Now define s := \f^ and note that by Lemma 6.9, is := < \f^ ■= s. Since g is increasing in 

its second argument we have g(fib^ n ,is) < g(ixb 2 n ,s). Using this in (6.26) we arrive at 

uM^v < g(nb 2 m , s ), (6.26) 

which holds as long as 


A.< 


(s 2 - s + 2) 

We again remind the reader that s was defined as s ■= w —. Further, note that 


d . . ( s 2 - 2s + 2) 

-^-g(x,s) = 
ox 


s 2 x 2 


2 s 2 x 


(2 - (s 2 - 2s + 2) x) 2 2 - (s 2 - 2s + 2) 


x 


+ 2s - 1, 


and 


Thus 


d 2 . . 8s 2 

—— q(x,S) =- 5 -. 

dx 2 ^ ((s 2 - 2s + 2) a; - 2) 3 


d , x 
—g(x,s) = 0 


x = 


2±\/2s(l-s) 2 
(s 2 - 2s + 2) 


3 3 

Also note that at x = 2+ ^!. ( 2 1 s ~^ ^ » ^d( x , s ) < 0 and at x = " > ^d( x ,s) > 0. Thus 

g(/ib 2 - l ,s) as a function of s, is minimized at either of the following two points 


)2 2 - %/2s(l - sY 

^ = (s 2 -25 + 2) 


or nb; n = 


s 2 - s + 2 


The value of g([ib 2 n ,s) at these two points is given by 


9i(s) =9\ 


'2->/2s(l-s)~§ 
(s 2 - 2s + 2) 


s = 1-2 


( 72(1 -s) 15 -sf 
(s 2 - 2s + 2) 


2 ’ 


02 OO =g( 2 2 + 9 ’ g ) = 4^ 

V s z - s + 2 / s^ - 


s(s + 5) 


5 + 2 


_ _ 3 

It is easy to check that for 0 < s < 1 we have g\ (s) < 52 ( 5 )- We note that 2 ^ 


non-negative for s < 0.5. Now using 

m 2 - 72m 0 7m- (v/m-^/mo) 


/* = 


(mo - 2^/mmo + 2m) 


l2 2-+2s(l-s) 
= 


(s 2 - 2s + 2) ’ 
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in (6.26) we have 


uM^v <1-2 


(^(l-s) 1 - 5 -^ 

(s 2 -2s + 2) 2 


1-2 


(\/2(l- , /m °) 1 - 5 /mo' 
V v ^_V m ' _V m ) 

(mu _ 2 /Ha + 2) 2 

\ m m / 


m 


Finally, note that for m > 4mo we have ip(—) < 1 and p > 0, concluding the proof. 


6.3.6 Lower bounds on convergence rate (Proof of Theorem 2.5) 

To obtain lower bounds on the convergence rate, we make use of [69, Lemma F.l] stated below, 
which relates the projection onto a set and the projection onto its conic approximation around the 
origin. 

Lemma 6.10 Let V e IR n be a closed and convex set containing 0 and let C =cone(V ) be its conic 
hull. Given any 0 < a, e < 1, there exists a positive constant 5 '■= 5(a,e,V ) such that, for all vectors 
w obeying 


\\'Pc(w)\\ e2 >a\\w\\ h , (6.27) 

then 

\Wv(w )\\ £2 ^ 

\Wdw)\\ l2 - 

holds for all ||io||^ < 5. 

This lemma suggests that for any set containing the origin, around a sufficiently small neighborhood 
of 0, the conic hull is an arbitrarily good approximation of the original set. We shall apply this 
lemma with T> and C equal to the descent set and tangent cone of / at x. We remind the reader 
that from the third line of (6.9) we know that the error h T = z T -x in our iterates obeys the update 

h T+1 = Vv((I-fiA*A)hr). (6.28) 

Now note that we have 

\\(I - fiA*A)hr\\ h < \\I-fj,A*A\\ I hr\\ t2 < (l + M0-max( A )) \\ h 4e 2 ■ (6.29) 

Applying Lemma 6.7 equation (6.13) together with Lemma 6.9 and (6.29) we also have 


\\V C ((I ~/aA*A)h T )\\ £2 = sup v* (/- pA*A) h T 

vtCnS ™- 1 

1 -h*(I- pA*A)h T 


I h T 


Hi 


I h 




1-p- 


I Ahr 


b 2 


I h T 


h 2 


> \\h T \\ h (l -/i(6 m + uj + t]) 2 ) 


- II II l 2 


(l -/i(\/m + \/ra()) 2 ) 

1 - p,(drn + d™oY 




/ - pA* A)h 7 


■ 


(6.30) 
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As a result based on (6.29) and (6.30) the requirement (6.27) of Lemma 6.10 is satisfied for the 
vector w = (7 - fiA* A)h r with a = ( /'(v^+n/^o) j Thus by Lemma 6.10 there exists a constant 

5 such that for all ||(7 - fiA* A)h T \\ i2 < 5 

\\Vv ((7 - »A*A)hr)l t2 > (1 - e T ) |Vc ((7 - »A*A)h r )||, a 

-n 2 x 

holds with probability at least 1 - e 2 . Thus using r = 1+M0 .a ^ we can conclude that for all 

IIM <a 

\\h T+1 1|, 2 = \Vv ((/ ~ ^A*A)h T ) ||, 2 >(1 - e) I Vc ((7 - »A*A)h T ) ||, 2 

>(1 - e) (l - /r (\/m + v / mo) J j ||h r ||, 2 , 

holds with probability at least 1-e 2 . By inductively applying the latter inequality for all starting 
points ||^o||, 2 < r we will show that for all t > 0 

INI, 2 > (1 - ef (1 - n (Vm + IIM, 2 ■ 

Suppose the proposed equation holds for t = 1,2,..., r. Define 

h T = (1 - eY (l - fi ||fo 0 ||, 2 

Using I h T ||^ < || h T ||, 2 , and applying Lemma 6.5 for t = r + 1 we have that 

II hr +1 1|, 2 = W'P'D ((7 — fiA* A^)h T ) ||, 2 , 

>\\Vv{{I-^A*A)h T )\\^ 

> (1 - e) (l - yU (v/m + \\hr\\ h , 

>(l-e) T+1 (l -^i(\/m + \/mo) 2 ) ||M, 2 ; 

concluding the proof for all t by induction. 

6.4 Sensitivity to the tuning parameter (Proof of Theorem 2.6) 

In all of our proofs so far we have assumed that the parameter R is tuned perfectly and is set to 
R = f(x). This is of course problematic because we do not know the signal x, and therefore also 
don not know f(x ) in advance. Crucial to our proofs so far has been the fact that different between 
our iterates z T and the unknown signal x lie in the descent set V. However, this is no longer true 

when R + f(x). As a result it is important to ensure that our results are robust to perturbations 

of the feasible set V. We will next illustrate how one can get bounds that are robust to the change 
of V. 

• To model R > f(x), define T>f up = {w\f(x + w) < R}. 

• To model R < /(*), define T>f ub = {w\f(x + w) < R}. 
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Similarly, we define IC^ up = {z\f(z) < R} and K.f ub = {z\f(z) < R} for R > f(x) and R < f(x) and 
note that Vf up = Kf up - {x} and Vf ub = ICf ub ~{x). 

Before we begin our arguments we state a lemma which we utilize throughout the proof of this 
theorem. The proof of this theorem is essentially identical to bounding the convergence rate in the 
proof of Theorem 2.2 in Section 6.3.3. We skip the details. 

Lemma 6.11 Let x e IR n be an arbitrary vector in IR n and f ■ IR n -+ IR be a proper function. Suppose 
A e [R mxn is a Gaussian map. Then 


sup - u* (7 - yA* A) v < -»/ 8 — 
■u,t!eCj-(a:)nS n-1 ’ ™ 


holds with probability at least 1 - 4e s . 

As a consequence of this lemma the upper bound on sup - u* (7 - yA* A) v is the same 

u, v cCf ( x ) n S ™ -1 

as the upper bound on p(/u). So we shall assume throughout the proof of this theorem that 


sup - u* (/ - fiA* A) v < p(/i). 
u , v eC f ( x ) n S n_ 1 


6.4.1 Under estimating the tunning parameter (R < f(x )) 

We first consider the case where R < f(x). In this case the difference between our iterates and 
the signal is in V^ ub , i.e. ( z T - x) e T > f ub . It is important to note that in this case 0 ^ T)^ ub since 
f(x) > R. To handle this case it suffices to develop an analogue of Theorem 1.2, the rest of the 
proof follows along similar lines to the proof of Theorem 2.2. Please see Section 6.3.3 for further 
details. 

Lemma 6.12 Suppose, /(•) is a proper convex function and y = Ax. Define K.f ub = {z\f(z) < R}. 
Consider the iterations 


z t+1 = V 1c r (z t + yA* (y - Az T )) 

sub 

The error vector z T - x satisfies the following recursion 

Nr+i - x\\ e < k f p(fi) \\z T - x\\ t + (3 -re/) a; - V k r (x) 
Proof First note that by Lemma 6.3 we have 

z T+ i - x =V k r t x x(z T -x + yA*(y - Az r )) 

sub 1 J 

, (i 1 - yA*A)(z T -x)). 

sub 

Now defining h T = z T - x we arrive at 

h T+ i=V V R (h T -yA*Ah T ). 

sub 




(6.31) 
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We remind the reader that we use C ■= Cf(x) to be the tangent cone of / at the point x. In 
our calculations below it is convenient to use the short-hand notations h T := h T - fiA*Ah r and 
w := V v r (0) = x - V]cr ( x ). With this notation (6.31) can be rewritten as 

h T +\ = 'P'pR (^r)* 

sub 

We consider two cases. In the first case we assume the function / is convex and Kf = 1. In the 
second case we do not assume convexity of / and we prove the inequality with Kf = 2. 


Proof for convex / 

When the function / is convex which immediately implies that the sets K.f ub and T>^ ub are 
convex. Noting that T>f ub c C we have 


\\h T -V c (h T )\ < h r -V V R (h T ) 

"*■2 sub 

Also by Lemma 6.1 equation (6.1) 


(2 


(6.32) 


w - V v r ( h T ) + V v r ( h T ) - h 


12 


< \\w - h T \\, 

i 2 11 11 & 


Furthermore, using the fact that w ■= x - V^r ( x ) € C we have 

II™ _ Ml * Ml + \\w\] 2 ~ 2 W*h T 

<||/i T ||^ +||tu||^-2 w* (I -/jlA* A) h T 


— II111 + H™ll| +2 \\ W \\l 2 IIM* a 


sup 

•u.uECnS 71-1 


- u (I - fiA* A) v 


^ IMfo + Mlf 2 + 2 p(p) Wl< 2 \\ h r\\i 2 


Finally, we also note that 

WVcCh T )l 2 = \\V c ((I-iiA*A)h T )\\ e2 

u* (I - fiA* A) v\ ■ \\h T \\ h 


< I sup 

ytt,t)6CnS n_1 

^p(p) \\hr\\t 2 ■ 


Using (6.33) and (6.34) we have 


w - V v r ( h T ) + V v r ( h T ) - h 


h 


f2 ^ \\ h r\\ h + HI 1 2 + 2 P(p) M 


l 2 II h r II £ 2 


(6.33) 


(6.34) 


(6.35) 


Combining this with (6.32), applying Lemma 6.2 equation (6.3), and using (6.35) we conclude that 


w - e 2 + I v c(h T ) - hr || & < \\h T \\ i2 + \\w\] 2 + 2 p(n) \\w\\^ ||h 


U 2 


= W'Pc(hr) - hr\\ h + \\Vc {hr)l i2 + Hl| + 2p(fi) \\w\\ h \\h T \\ h 
< I Vc(h T ) - hr 111 + (p(p)) 2 \\hr\\ 2 h + \\w\\j 2 + 2 p(fi) \\w\\ t2 \\h r \\ l2 . 


43 


























Simplifying the above expression we conclude that 


W ~ Vv sub to - P ^ 11^11^2 + HI*, ■ 


Finally, applying the triangular inequality we arrive at 


life 


r+llb 2 


Vv R X hT ) , - w -^v R X h r) , +\\w\\ h <p(^)\\h T \\ e +2 \\w\\ £ 

sub £2 su ° -t-2 z z z 


concluding the proof for convex functions /. 

Proof for general / 

Note that since w e , by definition of projection onto V^ ub we have 


h T - V d r ( h T ) 


< \\h T - w\\ 
h 11 "h2 


V d r (fe T ) <2 h*(v V R (fe r )) - 2fe*io + ||iw| 

^ su b II \ ^sub ) 


Noting that the points V v r ( h T ) and w belong to Cf(x ) we have 


<2H* t (v v r (h T )) - 2h*w + \\w\\ 2 e 

sub \ sub / z 


<2h* (/ - fiA* A) V v r (hr) - 2 h* (/ - fiA* A) w + 


w 


<2p(fi) \\h T \\ l2 V V R b (h T ) ^ + 2 p(fi) \\h r \\ i2 \\w\\ e2 + \\w\\ e2 


Completing the square we arrive at 

(l^(^)|L 2 -^) * (p(m) IM& + Hl^f , 


which implies that 


2 

^2 ' 


'PV R X h r ) , ^ 2p(/i) ||fe r || £2 + Hlfc- 

S160 ^2 ^ ^ 


6.4.2 Over estimating the tunning parameter (R > f(x )) 

We now consider the case where R > f(x). In this case the difference between our iterates and 
the signal is in D^ up , i.e. (,z T - x) e T>f up . It is important to note that in this case 0 e T>f up since 
f(x) < R. To handle this case it suffices to develop an analogue of Theorem 1.2, the rest of the 
proof follows along similar lines to the proof of Theorem 2.2. Please see Section 6.3.3 for further 
details. 

Lemma 6.13 Suppose, /(•) is a homogeneous function and y = Ax. Define 1Cf up = {z\f(z) < R}. 
Consider the iterations 


z t+i = V K R p (z T + yA* (y - Az r )) 
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The error vector z T - x satisfies the following recursion 


Z T+ 1 - X\\ £ < Kfp(fl) \\Z T ~ X\\ e +2Kf(l + p{fl)) 


R 

/(*) 


- 1 II a: I 


(■2 ’ 


where the convergence rate p(p) is defined as the restricted singular value of the descent cone of f 

at tt-^-x. That is, 

/(*) ? 


p(p) = sup u*(I-pA*A)v. 

y*)nS n 1 


Furthermore, if f is a norm then p(p) = p(p). 


Proof Define * = jppjX and note that since / is a homogenous function /(*) = j^jf(x) = R. 
Consequently, we may imagine that we are trying to estimate * where w = Ax - Ax is the additive 
noise on our measurements Ax. Applying (6.9) in the proof of Theorem 1.2 in Section 6.2 with C 
set to Cf(x ) we can conclude that 

||z T+ l - X\i <Kfp(p) \\z T ~ X | ^ + Kf ■ p ■ sup u* A*w 

uzCf(x) nS" -1 


=Kfp(p)\\z T -x\\ e +Kfp- sup u*A*A{x-x) 

ucCf(x) nS n_1 


<Kfp(p)\\z T -x\\ e +Kf- sup U*(x~x) + Kf ■ sup 

ii6C/(i)nS n ^ 1 ueC/(a;)nS 71_1 

= Kfp(p) \\Z T ~ x\ t + Kf ||a: - x\i 2 + Kf ■ sup - u* (I - pA* A) (x - x) 

ueCf(x) nS n_1 


u* (I - pA* A) (* - *) 


<Kfp(p) || Z T ~ x\\f 2 + Kf\\x ~ X\i + Kf 

=K f p(p) || Z T ~ X\\ C2 + Kf{ 1 + p{p)) || X ~ X\\ e2 
<K f p(p) (| Zr- X\\ l2 + ||* - *|| /2 ) + Kf( 1 + p(p)) 

=K f p(p) \\Z T - x\ t + Kf f - 1 } (1 + 2 p(p)) \\x\\ £2 . 


sup - u* (I - pA* A) v \\\x - x\\ £ 
u,vzCf( d;)nS n_1 J 


X - x\ 


12 


/(*) 


The proof is complete by using the triangular inequality 


z T +i ~ x\\ e < \\z r+ i - x\\ e + ||* - x\\ h < Kfp(p) | z T - x\\ h + 


R 

/(*) 


-l](Kf + l + 2K f p(p)) 


* 


t-2 ■ 


Finally, we note that when / is a norm C/(*) = C/(*) so that p(p) = p(p). 


6.5 Proof of Theorem 2.8 (ISG ensembles) 

When the measurement matrix is sub-Gaussian (recall Definition 2.7), we have equivalent results 
with possibly larger constants that depend on the sub-Gaussian parameter. This directly follows 
from the results of Dirksen [30]. For earlier results with a similar flavor, the reader is referred to 
Mendelson et al. [57] as well as [56]. In particular, we use the following restatement of Theorem 
4.18 of [30]. 
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Lemma 6.14 Suppose A is a matrix with isotropic and i.i.d. sub-Gaussian rows with parameter 
A. For a set T £ rB n ~ l 


sup 

•ueT 



A* A 


m 


)v 


< Ca 


r(uj(T) + rrf) 
y/rn 


(6.36) 


holds with probability at least 1 - e 7,7 . Here, Ca only depending on the sub-Gaussian parameter 
A and 7 is a fixed numerical constant. 


Proof We briefly describe how this lemma is a consequence of results in [30]. This lemma follows 
from taking the square root of line (14) of [30], completing the squares and using the fact that 
Talagrand’s 72 functional is equal to Gaussian width up to a multiplicative absolute constant 
(e.g. see [87]). ■ 

Applying this Lemma with different sets T of the form C n S n_1 - C n S n-1 ,C n S n_1 + C n S n_1 we 
can deduce that, for T e {C n S"" 1 -C n S^.Cn S”" 1 + C n S" -1 } 


sup 

i>eT 



A* A 


m 


)v 


< 4C*a 


(oj (C n S n_1 ) + 77 ) 

y/rn 


hold with high probability for a constant c(A) that depends only on the sub-Gaussian constant A. 
Observe that for these sets r = 2. In summary, for our purposes a sub-Gaussian matrix behaves like 
a Gaussian up to a constant that depends only on the sub-Gaussian parameter. This is good enough 
to ensure that all of our results can be extended to sub-Gaussians with essentially no modification 
in our analysis. In particular, combining the arguments of Section 6.3.3 with lemma above we 
arrive at the following bound on the convergence rate 

/?(—)= sup u* (I - — 4 Ca 

m u, V eC v ml 



6.6 Proof of Theorem 2.10 (SORS ensembles) 

Our proof strategy for this theorem is exactly the same as the proof of Theorem 2.8 on ISG 
ensembles described in Section 6.5. However, we need a variant of Lemma 6.14 for SOS ensembles. 
Below we state this lemma which is proven in our companion paper [71]. 

Lemma 6.15 Suppose A e [R mxn i s selected from the SORS distribution of Definition 2.9. For a 
set T £ rB n ~ l 


sup 

ueT 



A* A. 

- )v 


m 



holds with probability at least 1 - e 2r? as long as 

m > C'A 2 (rj + l) 2 (logn ) 4 max ^ 




Here A is the bound in (2.14) and C,C' are fixed numerical constants. 
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A Proof of Gordon type lemma (Lemma 6.8) 


Our proof is related to the proof of Gordon’s celebrated escape through the mesh [45, Theorem A] stated 
in Lemma 6.7. We will first show the bound |Att|k < b m ||« ||^ + w(T) + r/. For u e T and v e S m 1 = {i> e 
[R m ; I v | ^ = 1}, we define three Gaussian processes 


X u v = v*Au, Y u>v = \\u\\ i2 a*v + g*u and Z u>v = \u\ t2 (a*v - b m ) + g*u. 

Here a e IR m is distributed as A/XO, I m ) and g s R n is distributed as J\f( 0, /„). 

It follows that for all it, it' € T and v,v' e S m_1 , we have 

E| Y u , v - Y u ,, v ,\ 2 - E|X U) „ - A' u /,d'| 2 = HI fe + Hllfe - 2||ti|fe Hllfe (v,v') - 2(u,u') (1 - (v,v')) 

>Hlfe + Hllfe - 2||u|| f2 Hllfe (v,v ') - 2\\u\\ i2 Hllfe (1 - {v,v')) 

> 0 , 

with equality if u = u' and v - v'. 

We note that by standard concentration of measure for Gaussian random variables 


p { 1 °life ^ mnife] +v] < e , 


We also have 


(a : ||u|lfe ||a|| j g 2 > ||tt|lfe b m + ??) c(a : 


■{ 


IHIfe ll a llfe ^ IHIfe bm + IHIfe } 

V 


—\ a ■ I cl II A > b nlj + 


<T)- 


A - ’ ^(7-) 


} 


={a: ||a|| £2 >E[||a|| fe ] + -^}. 


Thus, 


p U{ 

UeT 


a: llallfe ll a llfe 


1 n i 

> ll«II e 2 b m + Vi}\< P{a: ||olife > E[||a||fe] + < e _ 2 - 2 <r) ; 


which immediately implies 


Pi max || uL (I a 




€2 VII “11*2 


-M>!) 


- 1 < e 8 ' (T). 


(A.l) 


Also 


Note that if 


2 

{ max ( g*u ) > w(T) + -} = p{ max ( g*u ) > E [max ( g*u ) 1 + -} < e . (A.2) 

V iteT 2 j V u€l~ L ueT J 2 ' 


max (| 11 1| <2 (|k||fe - H) + < 7 * 11 ) > g, 


then either 


max |u|k (||a|k -b m ) > 


or 


max ( g*u) > co(T) + 
ueT 2 
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This implies that 


{a,g : max (H| f2 |a||^ + g*u - b m H| <2 ) > w(T) + g] 

is a subset of 

{a,g : max ||u||^ 2 (|| a ||^ 2 - b m ) > 1J {«,9 : max {g*u) > w(T) + |}. 

Using the latter together with (A.l) and (A.2) and using the independence of a and <7 we have 

Pjmauc (||u|| <2 (||a|| fe -b m ) + g*u) > 77} <P{miH H| fe {\\a\\e 2 ~ b m ) > + p{ max (g*u) > u{T) + 

T 

<2e 8 - 2 < r >. (A.3) 

Now note that 


max Z u1) = max (||ti|L„ (||a|L - b m ) + g*u) . 
utT, ueS" -1 uiT 

This together with (A.3) implies 

2 2 
P{ max Z u , v > 77} < 2 e” 8 - 2 <r) => p| 1 J > b m \\u\\ e + g]\ < 2e~ 8 - 2 < r >. 

Now using Slepian’s second inequality (6.11) with g uv = b m + g we have 

2 

p{ U i x -u.,v > b m ||u|| fe +77]} < p{ (J [Y u , v >bm\\u\\ e2+V ]\<2e-^. 
ueT.ue S"- 1 “ J ueT.'ueS'*- 1 J 

Noting that 

P { max X u v > | u |L b m + 77 1 — ^ ] (_J [ X uv > b m || u | * + 77] 1 , 

I- U6 r. S"- 1 J -ueS"- 1 J 

concludes the proof. Next, we show that || Au\\ io > b m ||it||^ - w(T) - g. To accomplish this, we make use of 
Lemma 5.1 of [73]. The following is an immediate corollary. 

Corollary A.l Let A e R mxn ,g e R n ,h e R m be independent vectors with independent N (0,1) entries. 
Then, for any c € [R 

P(min max v* Au - b m ||n.||> c) > 2P(min max v*h\\u\\ i2 - u*g\\v\\ i2 - b m \\u\\ e2 > c). (A.4) 

The right-hand side can be simplified to min u 6 7 -(||/i ]|^ 2 - b m ) ||ii||^ -u*g, which is x/2er(T)-Lipschitz function 
of the vector \g* /i*]. As a result we have 

P(min(||/i ||^ 2 - b m ) \\u\\ t2 - u g > -u(T) ~g)>l~ exp(- ^J^ ). (A.5) 

The proof is complete by combining the latter with (A.4). 
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B Proof of Lemma 6.9 


To prove this lemma it suffices to show that g(t) := log ) is increasing when t > 0. To this aim note that 

9(0 -1 log 2 + log (r (—)) _ iog(r(|)) - i tog t. 


Thus, 


Now using the well known fact that 8 

Ht) : = 


1 (Eim Eliil !' 

9 21 r(!±i) r(|) t j 

8 

T'(i) _ r°°/ e - x e~ tx 
T(t) Jo \ x 1 - e _x 


dx, 


together with the change of variables y - x /2 we have 


3 1 - e 2 

e 2 -dx — 

1 - e 

1 - p - ^ I 

— V d y- — 

1 - e~ 2 y 2 1 


: -i) 


—— dy- - 
1 + e-y 2 1 


r 

Jo 

r 

Jo 

r~ oo 1 z-' oo 

: / - dy- / e~ 2tv dy 

Jo 1 + e _y Jo 

f 

Jo 


e 

1 - e~^ 
1 + e-« 


dy 


> 0 , 


where the last inequality holds since the integrand is non-negative when t > 0 and y > 0. This completes the 
proof as we have shown that g'(t ) is non-negative for t > 0 . 


3 For example see the wikipedia entry on the Digamma function (more specifically the part on integral representations). 
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